Methods and systems for dynamic variant thresholding in a liquid biopsy assay

ABSTRACT

Methods, systems, and software are provided for validating a somatic sequence variant in a subject having a cancer condition. Sequence reads are obtained from sequencing cell-free DNA fragments in a liquid biopsy sample of the subject. Sequence reads are aligned to a reference sequence. A variant allele fragment count and locus fragment count are identified for a candidate variant that maps to a locus in the reference sequence. The variant allele fragment count is compared against a dynamic variant count threshold for the locus. The threshold is based on a pre-test odds of a positive variant call for the locus, based on the prevalence of variants in a genomic region including the locus in a cohort of subjects having the cancer condition. The somatic sequence variant in the subject is validated, or rejected, when the variant allele fragment count for the candidate variant satisfies, or does not satisfy, the threshold.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/179,086, filed Feb. 18, 2021, which claims priority to U.S.Provisional Patent Application No. 63/041,293, filed Jun. 19, 2020, andU.S. Provisional Patent Application No. 62/978,130, filed Feb. 18, 2020,the contents of which are hereby incorporated by reference, in theirentireties, for all purposes.

FIELD OF THE INVENTION

The present disclosure relates generally to the use of cell-free DNAsequencing data to provide clinical support for personalized treatmentof cancer.

BACKGROUND

Precision oncology is the practice of tailoring cancer therapy to theunique genomic, epigenetic, and/or transcriptomic profile of anindividual's cancer. Personalized cancer treatment builds uponconventional therapeutic regimens used to treat cancer based only on thegross classification of the cancer, e.g., treating all breast cancerpatients with a first therapy and all lung cancer patients with a secondtherapy. This field was borne out of many observations that differentpatients diagnosed with the same type of cancer, e.g., breast cancer,responded very differently to common treatment regimens. Over time,researchers have identified genomic, epigenetic, and transcriptomicmarkers that improve predictions as to how an individual cancer willrespond to a particular treatment modality.

There is growing evidence that cancer patients who receive therapyguided by their genetics have better outcomes. For example, studies haveshown that targeted therapies result in significantly improvedprogression-free cancer survival. See, e.g., Radovich M. et al.,Oncotarget, 7(35):56491-500 (2016). Similarly, reports from the IMPACTtrial—a large (n=1307) retrospective analysis of consecutive,prospectively molecularly profiled patients with advanced cancer whoparticipated in a large, personalized medicine trial—indicate thatpatients receiving targeted therapies matched to their tumor biology hada response rate of 16.2%, as opposed to a response rate of 5.2% forpatients receiving non-matched therapy. Tsimberidou A M et al., ASCO2018, Abstract LBA2553 (2018).

In fact, therapy targeted to specific genomic alterations is already thestandard of care in several tumor types, e.g., as suggested in theNational Comprehensive Cancer Network (NCCN) guidelines for melanoma,colorectal cancer, and non-small cell lung cancer. In practice,implementation of these targeted therapies requires determining thestatus of the diagnostic marker in each eligible cancer patient. Whilethis can be accomplished for the few, well known mutations associatedwith treatment recommendations in the NCCN guidelines using individualassays or small next generation sequencing (NGS) panels, the growingnumber of actionable genomic alterations and increasing complexity ofdiagnostic classifiers necessitates a more comprehensive evaluation ofeach patient's cancer genome, epigenome, and/or transcriptome.

For instance, some evidence suggests that use of combination therapieswhere each component is matched to an actionable genomic alterationholds the greatest potential for treating individual cancers. To thispoint, a retroactive study of cancer patients treated with one or moretherapeutic regimens revealed that patients who received therapiesmatched to a higher percentage of their genomic alterations experienceda greater frequency of stable disease (e.g., a longer time torecurrence), longer time to treatment failure, and greater overallsurvival. Wheeler J J et al., Cancer Res., 76:3690-701 (2016). Thus,comprehensive evaluation of each cancer patient's genome, epigenome,and/or transcriptome should maximize the benefits provided by precisiononcology, by facilitating more fine-tuned combination therapies, use ofnovel off-label drug indications, and/or tissue agnostic immunotherapy.See, for example, Schwaederle M. et al., J Clin Oncol., 33(32):3817-25(2015); Schwaederle M. et al., JAMA Oncol., 2(11):1452-59 (2016); andWheler J J et al., Cancer Res., 76(13):3690-701 (2016). Further, the useof comprehensive next generation sequencing analysis of cancer genomesfacilitates better access and a larger patient pool for clinical trialenrollment. Coyne G O et al., Curr. Probl. Cancer, 41(3):182-93 (2017);and Markman M., Oncology, 31(3):158, 168.

The use of large NGS genomic analysis is growing in order to address theneed for more comprehensive characterization of an individual's cancergenome. See, for example, Fernandes G S et al., Clinics, 72(10):588-94.Recent studies indicate that of the patients for which large NGS genomicanalysis is performed, 30-40% then receive clinical care based on theassay results, which is limited by at least the identification ofactionable genomic alterations, the availability of medication fortreatment of identified actionable genomic alterations, and the clinicalcondition of the subject. See, Ross J S et al., JAMA Oncol., 1(1):40-49(2015); Ross J S et al., Arch. Pathol. Lab Med., 139:642-49 (2015);Hirshfield K M et al., Oncologist, 21(11):1315-25 (2016); and GroisbergR. et al., Oncotarget, 8:39254-67 (2017).

However, these large NGS genomic analyses are conventionally performedon solid tumor samples. For instance, each of the studies referenced inthe paragraph above performed NGS analysis of FFPE tumor blocks frompatients. Solid tissue biopsies remain the gold standard for diagnosisand identification of predictive biomarkers because they representwell-known and validated methodologies that provide a high degree ofaccuracy. Nevertheless, there are significant limitations to the use ofsolid tissue material for large NGS genomic analyses of cancers. Forexample, tumor biopsies are subject to sampling bias caused by spatialand/or temporal genetic heterogeneity, e.g., between two regions of asingle tumor and/or between different cancerous tissues (such as betweenprimary and metastatic tumor sites or between two different primarytumor sites). Such intertumor or intratumor heterogeneity can causesub-clonal or emerging mutations to be overlooked when using localizedtissue biopsies, with the potential for sampling bias to be exacerbatedover time as sub-clonal populations further evolve and/or shift inpredominance.

Additionally, the acquisition of solid tissue biopsies often requiresinvasive surgical procedures, e.g., when the primary tumor site islocated at an internal organ. These procedures can be expensive, timeconsuming, and carry a significant risk to the patient, e.g., when thepatient's health is poor and may not be able to tolerate invasivemedical procedures and/or the tumor is located in a particularlysensitive or inoperable location, such as in the brain or heart.Further, the amount of tissue, if any, that can be procured depends onmultiple factors, including the location of the tumor, the size of thetumor, the fragility of the patient, and the risk of comorbiditiesrelated to biopsies, such as bleeding and infections. For instance,recent studies report that tissue samples in a majority of advancednon-small cell lung cancer patients are limited to small biopsies andcannot be obtained at all in up to 31% of patients. Ilie and Hofman,Transl. Lung Cancer Res., 5(4):420-23 (2016). Even when a tissue biopsyis obtained, the sample may be too scant for comprehensive testing.

Further, the method of tissue collection, preservation (e.g., formalinfixation), and/or storage of tissue biopsies can result in sampledegradation and variable quality DNA. This, in turn, leads toinaccuracies in downstream assays and analysis, includingnext-generation sequencing (NGS) for the identification of biomarkers.Ilie and Hofman, Transl Lung Cancer Res., 5(4):420-23 (2016).

In addition, the invasive nature of the biopsy procedure, the time andcost associated with obtaining the sample, and the compromised state ofcancer patients receiving therapy render repeat testing of canceroustissues impracticable, if not impossible. As a result, solid tissuebiopsy analysis is not amenable to many monitoring schemes that wouldbenefit cancer patients, such as disease progression analysis, treatmentefficacy evaluation, disease recurrence monitoring, and other techniquesthat require data from several time points.

Cell-free DNA (cfDNA) has been identified in various bodily fluids,e.g., blood serum, plasma, urine, etc. Chan et al., Ann. Clin. Biochem.,40 (Pt 2):122-30 (2003). This cfDNA originates from necrotic orapoptotic cells of all types, including germline cells, hematopoieticcells, and diseased (e.g., cancerous) cells. Advantageously, genomicalterations in cancerous tissues can be identified from cfDNA isolatedfrom cancer patients. See, e.g., Stroun et al., Oncology, 46(5):318-22(1989); Goessl et al., Cancer Res., 60(21):5941-45 (2000); and Frenel etal., Clin. Cancer Res. 21(20):4586-96 (2015). Thus, one approach toovercoming the problems presented by the use of solid tissue biopsiesdescribed above is to analyze cell-free nucleic acids (e.g., cfDNA)and/or nucleic acids in circulating tumor cells present in biologicalfluids, e.g., via a liquid biopsy.

Specifically, liquid biopsies offer several advantages over conventionalsolid tissue biopsy analysis. For instance, because bodily fluids can becollected in a minimally invasive or non-invasive fashion, samplecollection is simpler, faster, safer, and less expensive than solidtumor biopsies. Such methods require only small amounts of sample (e.g.,10 mL or less of whole blood per biopsy) and reduce the discomfort andrisk of complications experienced by patients during conventional tissuebiopsies. In fact, liquid biopsy samples can be collected with limitedor no assistance from medical professionals and can be performed atalmost any location. Further, liquid biopsy samples can be collectedfrom any patient, regardless of the location of their cancer, theiroverall health, and any previous biopsy collection. This allows foranalysis of the cancer genome of patients from which a solid tumorsample cannot be easily and/or safely obtained. In addition, becausecell-free DNA in the bodily fluids arise from many different types oftissues in the patient, the genomic alterations present in the pool ofcell-free DNA are representative of various different clonalsub-populations of the cancerous tissue of the subject, facilitating amore comprehensive analysis of the cancerous genome of the subject thanis possible from one or more sections of a single solid tumor sample.

Liquid biopsies also enable serial genetic testing prior to cancerdetection, during the early stages of cancer progression, throughout thecourse of treatment, and during remission, e.g., to monitor for diseaserecurrence. The ability to conduct serial testing via non-invasiveliquid biopsies throughout the course of disease could prove beneficialfor many patients, e.g., through monitoring patient response totherapies, the emergence of new actionable genomic alterations, and/ordrug-resistance alterations. These types of information allow medicalprofessionals to more quickly tailor and update therapeutic regimens,e.g., facilitating more timely intervention in the case of diseaseprogression. See, e.g., Ilie and Hofman, Transl. Lung Cancer Res.,5(4):420-23 (2016).

Nevertheless, while liquid biopsies are promising tools for improvingoutcomes using precision oncology, there are significant challengesspecific to the use of cell-free DNA for evaluation of a subject'scancer genome. For instance, there is a highly variable signal-to-noiseratio from one liquid biopsy sample to the next. This occurs becausecfDNA originates from a variety of different cells in a subject, bothhealthy and diseased. Depending on the stage and type of cancer in anyparticular subject, the fraction of cfDNA fragments originating fromcancerous cells (the “tumor fraction” or “ctDNA fraction” of thesample/subject) can range from almost 0% to well over 50%. Otherfactors, including tumor type and mutation profile, can also impact theamount of DNA released from cancerous tissues. For instance, cfDNAclearance through the liver and kidneys is affected by a variety offactors, including renal dysfunction or other tissue damaging factors(e.g., chemotherapy, surgery, and/or radiotherapy).

This, in turn, leads to problems detecting and/or validatingcancer-specific genomic alterations in a liquid sample. This isparticularly true during early stages of the disease—when cancertherapies have much higher success rates—because the tumor fraction inthe patient is lowest at this point. Thus, early stage cancer patientscan have ctDNA fractions below the limit of detection (LOD) for one ormore informative genomic alterations, limiting clinical utility becauseof the risk of false negatives and/or providing an incomplete picture ofthe cancer genome of the patient. Further, because cancers, and evenindividual tumors, can be clonally diverse, actionable genomicalterations that arise in only a subset of clonal populations arediluted below the overall tumor fraction of the sample, furtherfrustrating attempts to tailor combination therapies to the variousactionable mutations in the patient's cancer genome. Consequently, moststudies using liquid biopsy samples to date have focused on late stagepatients for assay validation and research.

Another challenge associated with liquid biopsies is the accuratedetermination of tumor fraction in a sample. This difficulty arises fromat least the heterogeneity of cancers and the increased frequency oflarge chromosomal duplications and deletions found in cancers. As aresult, the frequency of genomic alterations from cancerous tissuesvaries from locus to locus based on at least (i) their prevalence indifferent sub-clonal populations of the subject's cancer, and (ii) theirlocation within the genome, relative to large chromosomal copy numbervariations. The difficulty in accurately determining the tumor fractionof liquid biopsy samples affects accurate measurement of various cancerfeatures shown to have diagnostic value for the analysis of solid tumorbiopsies. These include allelic ratios, copy number variations, overallmutational burden, frequency of abnormal methylation patterns, etc., allof which are correlated with the percentage of DNA fragments that arisefrom cancerous tissue, as opposed to healthy tissue.

Altogether, these factors result in highly variable concentrations ofctDNA—from patient to patient and possibly from locus to locus—thatconfound accurate measurement of disease indicators and actionablegenomic alterations. Further, the quantity and quality of cfDNA obtainedfrom liquid biopsy samples are highly dependent on the particularmethodology for collecting the samples, storing the samples, sequencingthe samples, and standardizing the sequencing data.

While validation studies of existing liquid biopsy assays have shownhigh sensitivity and specificity, few studies have corroborated resultswith orthogonal methods, or between particular testing platforms, e.g.,different NGS technologies and/or targeted panel sequencing versus wholegenome/exome sequence. Reports of liquid biopsy-based studies arelimited by comparison to non-comprehensive tissue testing algorithmsincluding Sanger sequencing, small NGS hotspot panels, polymerase chainreaction (PCR), and fluorescent in situ hybridization (FISH), which maynot contain all NCCN guideline genes in their reportable range, thussuffering in comparison to a more comprehensive liquid biopsy assay.

As an example, conventional liquid biopsy assays do not provide a methodfor accurately detecting variants (e.g., variant alleles) in ctDNA NGSassays. As described above, many patients may not have abundant ctDNA inearly stage disease and may shed variants below the limit of detection(LOD) for ctDNA assays, resulting in false negatives. Detecting thesevariants at low circulating fractions is also technically challengingdue to constraints of sequencing by synthesis. Additionally,differentiating between germline and somatic variants in ctDNA isdifficult, as is differentiating between mutations derived from clonalhematopoiesis (CH) and the solid tumor being assayed. In such cases,mutations in hematopoietic lineage cells may be mistaken fortumor-derived mutations. Indeed, researchers have identified severalgenes frequently mutated in CH with potential importance in cancer, suchas JAK2, TP53, GNAS, IDH2, and KRAS. Mayrhofer et al., 2018, “Cell-freeDNA profiling of metastatic prostate cancer reveals microsatelliteinstability, structural rearrangements and clonal hematopoiesis,” GenomeMed, (10), pg. 85; Hu et al., 2018, “False-Positive Plasma GenotypingDue to Clonal Hematopoiesis,” Clin Cancer Res, (24), pg. 4437.

The information disclosed in this Background section is only forenhancement of understanding of the general background of the inventionand should not be taken as an acknowledgement or any form of suggestionthat this information forms the prior art already known to a personskilled in the art.

SUMMARY

Given the above background, there is a need in the art for improvedmethods and systems for supporting clinical decisions in precisiononcology using liquid biopsy assays. In particular, there is a need inthe art for improved methods and systems for identifying somatic tumormutations in cell-free DNA, particularly where the sample has low tumorfractions. Advantageously, the present disclosure solves this and otherneeds in the art by providing improved somatic variant identificationmethodology that better accounts for locus-specific and/or samplespecific considerations to more accurately identify true somaticmutations in a liquid biopsy sample. For example, by using anapplication of Bayes theorem to account for one or more of (i) theprevalence of variants at a specific locus in a specific cancer type,(ii) the variant allele fraction for the variant being evaluated, (iii)the prevalence of sequencing errors at a particular locus, and (iv) theactual sequencing error rate of a particular reaction, the variantfilter methodologies described herein tune the specificity andsensitivity of variant count thresholds in a locus-specific fashion toachieve higher accuracy of true somatic variant calling in a liquidbiopsy assay.

For example, in one aspect, the present disclosure provides a method ofvalidating a somatic sequence variant in a test subject having a cancercondition. The method is performed at a computer system having one ormore processors, and memory storing one or more programs for executionby the one or more processors.

The method includes obtaining, from a first sequencing reaction, acorresponding sequence of each cell-free DNA fragment in a firstplurality of cell-free DNA fragments in a liquid biopsy sample of thetest subject, thus obtaining a first plurality of sequence reads. Eachrespective sequence read in the first plurality of sequence reads isaligned to a reference sequence for the species of the subject, thusidentifying a variant allele fragment count for a candidate variant thatmaps to a locus in the reference sequence, and a locus fragment countfor the locus encompassing the candidate variant.

The method further includes comparing the variant allele fragment countfor the candidate variant against a dynamic variant count threshold forthe locus in the reference sequence that the candidate variant maps to.The dynamic variant count threshold is based upon a pre-test odds of apositive variant call for the locus based on the prevalence of variantsin a genomic region that includes the locus from a first set of nucleicacids obtained from a cohort of subjects having the cancer condition.

The method then includes rejecting or validating the variant as a truesomatic variant based upon the dynamic variant count threshold. Forinstance, when the variant allele fragment count for the candidatevariant satisfies the dynamic variant count threshold for the locus, thepresence of the somatic sequence variant in the test subject isvalidated. And when the variant allele fragment count for the candidatevariant does not satisfy the dynamic variant count threshold for thelocus, the presence of the somatic sequence variant in the test subjectis rejected.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only illustrative embodiments of thepresent disclosure are shown and described. As will be realized, thepresent disclosure is capable of other and different embodiments, andits several details are capable of modifications in various obviousrespects, all without departing from the disclosure. Accordingly, thedrawings and description are to be regarded as illustrative in nature,and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A, 1B, 1C, 1D, 1E, and 1F collectively illustrate a block diagramof an example computing device for validating a somatic sequence variantin a test subject having a cancer condition, in accordance with someembodiments of the present disclosure.

FIG. 2A illustrates an example workflow for generating a clinical reportbased on information generated from analysis of one or more patientspecimens, in accordance with some embodiments of the presentdisclosure.

FIG. 2B illustrates an example of a distributed diagnostic environmentfor collecting and evaluating patient data for the purpose of precisiononcology, in accordance with some embodiments of the present disclosure.

FIG. 3 provides an example flow chart of processes and features forliquid biopsy sample collection and analysis for use in precisiononcology, in accordance with some embodiments of the present disclosure.

FIGS. 4A, 4B, 4C, 4D, 4E, and 4F collectively illustrate an examplebioinformatics pipeline for precision oncology, in accordance withvarious embodiments of the present disclosure. FIG. 4A provides anoverview flow chart of processes and features in a bioinformaticspipeline, in accordance with some embodiments of the present disclosure.FIG. 4B provides an overview of a bioinformatics pipeline executed witheither a liquid biopsy sample alone or a liquid biopsy sample and amatched normal sample. FIG. 4C illustrates that paired end reads fromtumor and normal isolates are zipped and stored separately under thesame order identifier, in accordance with some embodiments of thepresent disclosure. FIG. 4D illustrates quality correction for FASTQfiles, in accordance with some embodiments of the present disclosure.FIG. 4E illustrates processes for obtaining tumor and normal BAMalignment files, in accordance with some embodiments of the presentdisclosure. FIG. 4F provides a flow chart of a method for validating asomatic sequence variant in a test subject having a cancer condition, inwhich dashed boxes represent optional portions of the method, inaccordance with some embodiments of the present disclosure.

FIGS. 4G1, 4G2, and 4G3 collectively illustrate an examplebioinformatics pipeline for precision oncology, in accordance withvarious embodiments of the present disclosure. Specifically, thesefigures provide a flow chart of a method for validating a somaticsequence variant in a test subject having a cancer condition, in whichdashed boxes represent optional portions of the method, in accordancewith some embodiments of the present disclosure.

FIGS. 5A and 5B collectively provide a flow chart of processes andfeatures for validating a somatic sequence variant in a test subjecthaving a cancer condition, in which dashed boxes represent optionalportions of the method, in accordance with some embodiments of thepresent disclosure.

FIG. 6 illustrates a flow chart of a method for obtaining a distributionof variant detection sensitivities as a function of circulating variantallele fraction from a cohort of subjects, in accordance with someembodiments of the present disclosure.

FIGS. 7A and 7B collectively illustrate a method of inferring an effectof a sequence variant as a gain-of-function or a loss-of-function of agene, in accordance with some embodiments of the present disclosure.

FIGS. 8A, 8B, 8C, and 8D collectively illustrate results of aninter-assay comparison between a liquid biopsy assay, a digital dropletpolymerase chain reaction (ddPCR), and a solid-tumor biopsy assay, inaccordance with various embodiments of the present disclosure.

FIGS. 9A, 9B, 9C, 9D, 9E, 9F, 9G, and 9H collectively illustrate resultsof a comparison between circulating tumor fraction estimate (ctFE) andvariant allele fraction (VAF) using an Off-Target Tumor EstimationRoutine (OTTER) method, in accordance with various embodiments of thepresent disclosure.

FIGS. 10A and 10B collectively illustrate results of evaluating ctFE andmutational landscape according to cancer type, in accordance withvarious embodiments of the present disclosure.

FIGS. 11A, 11B, and 11C collectively illustrate results of evaluatingassociations between ctFE and advanced disease states, in accordancewith various embodiments of the present disclosure.

FIGS. 12A, 12B, and 12C collectively illustrate results of comparingctFE with recent clinical response outcomes, in accordance with variousembodiments of the present disclosure.

FIG. 13 illustrates a first table describing sensitivity for all SNVs,indels, CNVs, and rearrangements targeted in reference samples, inaccordance with various embodiments of the present disclosure.

FIG. 14 illustrates a second table describing sensitivity for all SNVs,indels, CNVs, and rearrangements targeted in reference samples, inaccordance with various embodiments of the present disclosure.

FIG. 15 illustrates a third table describing comparisons between thepresently disclosed liquid biopsy assay and a commercial liquid biopsykit, in accordance with various embodiments of the present disclosure.

FIGS. 16A, 16B, and 16C collectively illustrate a fourth tabledescribing variants detected by a liquid biopsy assay, in accordancewith various embodiments of the present disclosure.

FIG. 17 illustrates a fifth table describing dynamic filteringmethodology to further reduced discordance, in accordance with variousembodiments of the present disclosure.

FIG. 18 illustrates a sixth table describing cancer groups included inclinical profiling analysis, in accordance with various embodiments ofthe present disclosure.

Like reference numerals refer to corresponding parts throughout theseveral views of the drawings.

DETAILED DESCRIPTION Introduction

As described above, conventional liquid biopsy assays do not provideaccurate determination of variants (e.g., somatic variants),particularly at low circulating variant fractions. This is due, in largepart, to the use of static variant count filters that require a commonamount of support to call a variant positively as a somatic variant insequencing data, regardless of the identity of the variant and itsposition within the genome. That is, conventional methods require thatat least X number of unique sequence reads (e.g., 8 sequence reads)provide support for (e.g., encompass) a particular variant in order forthat variant to be confirmed as a true somatic variant. While this maybe fine for liquid biopsy samples having a high tumor fraction, wheremore copies of each somatic variant would be expected to be found, itresults in a high number of false negatives when samples with lowertumor fractions are analyzed. On the other hand, simply lowering thethreshold to allow calling of variants with lower support for aparticular variant will increase the number of false positives, that isthe number of untrue positive somatic variant calls, which are actuallysequencing errors.

While there are many methods of performing noise suppression onultra-high depth sequencing data commonly generated for liquid biopsyassays, there remains the fundamental fidelity boundary of sequencing bysynthesis that cannot be overcome. Along with this, there are a varietyof complexities and non-linearities within the ability to map readsacross complex sets of genomic features and from these data,successfully call a variant. While it is possible to filter verystringently, one of the goals of liquid biopsy assays is to detectalterations at very low circulating fractions. This requires that lowlevels of support be sufficient to make a positive alteration call giventhat at 0.1% circulating fraction and an average depth of 5000×, only 5reads containing alternate alleles will be present. Because of this, itis impossible to have a consistent set of thresholds that will be usedto filter variants as any filter will either be too stringent or toopermissive depending on the variant context and local sequence specificerror generation models.

Advantageously, the present disclosure provides methods and systems thatmore accurately call somatic variants by adjusting the variant countthreshold in a locus-by-locus fashion, e.g., by lowering the variantcount threshold when there is an increased likelihood (orthogonal to thevariant count in the sequencing reaction) that a variant at a particularlocus is a true somatic variant and/or by raising the variant countthreshold when there is an increased likelihood (orthogonal to thevariant count in the sequencing reaction) that a variant at a particularlocus is a result of a sequencing error, rather than a true somaticvariant.

For example, in some embodiments, the methods and systems describedherein employ a generalized application of Bayes' Theorem through thelikelihood ratio test that allows dynamic calibration of filteringthreshold for diagnostic assays. These thresholds are based on one ormore of a sample-specific error rate, a methodology-specific sequencingerror rate (e.g., from a pool of process matched healthy controlsamples), an estimate of the variant allele fraction for the variantbeing evaluated, and a historical likelihood that a variant would bepresent at a particular locus in a particular cancer (e.g., derived froman extensive cohort of human solid tumor tissue samples to informprobability models). This results in high sensitivity and specificity invariant detection, allowing identification of actionable oncologictargets, as well as determination of a precise limit of detection toreduce the occurrence of false negatives.

For instance, in some embodiments, the dynamic variant filteringmethodology described herein uses an application of Bayes theorem todynamically tune a variant count threshold for calling a somatic variantat a particular genomic region based on the prevalence of similarmutations within that genomic regions in similar cancers. For instance,where there is a high prevalence of a somatic variant in a given genefor a particular cancer, (e.g., BRCA1 mutations are common in breastcancers), the dynamic filtering method accounts for this prior (e.g.,the prior knowledge that BRCA mutations are commonly found in breastcancers) by setting a lower variant count threshold to call somaticvariants in the BRCA1 gene for a breast cancer. That is, the dynamicfiltering methodology requires less evidence in order to call a variantin the BRCA1 gene when the subject has breast cancer than when thesubject has a different cancer that is not associated with a highprevalence of BRCA1 mutations.

In some embodiments, the dynamic variant filtering methodology describedherein uses an application of Bayes theorem to dynamically tune avariant count threshold for calling a somatic variant based on anestimated variant allele fraction for the variant being evaluated. Thatis, the dynamic filtering methodology takes into account the fact thatin a sample having a lower tumor fraction, and therefore a lower variantallele fraction, a fewer number of sequences encompassing a somaticvariant would be expected than in a sample having a higher tumorfraction, and therefore a higher variant allele fraction. Accordingly,the sensitivity and specificity of the dynamic filter are tuned toaccount for the expectation that a higher percentage of variantsequences with low sequence counts (e.g., lower support) represent truesomatic variants in a sample with a low tumor fraction than in a samplewith a high tumor fraction, for which a higher percentage of variantsequences with low sequence counts represent sequencing errors.

In some embodiments, the dynamic variant filtering methodology describedherein used an application of Bayes theorem to dynamically tune avariant count threshold for calling a somatic variant at a particulargenomic locus based on a historical sequencing error rate for the locus.That is, the dynamic filtering methodology takes into account the factthat at genomic loci that are more prone to sequencing errors, such asloci with short nucleotide repeat sequences (e.g., di-nucleotide ortri-nucleotide repeats), there is a higher likelihood that a particularvariant is a product of a sequencing error, rather than a true somaticmutation, than at a locus that is not prone to sequencing errors.

Similarly, in some embodiments, the dynamic variant filteringmethodology described herein used an application of Bayes theorem todynamically tune a variant count threshold for calling a somatic variantat a particular genomic locus based on a reaction-specific sequencingerror rate. That is, the dynamic filtering methodology takes intoaccount the fact that in reactions with higher sequencing rates there isa higher likelihood that a particular variant is a product of asequencing error, rather than a true somatic mutation.

The present disclosure provides improved systems and methods forprecision oncology based on improved variant calling in liquid biopsydata. The various improvements described herein, e.g., improved variantdetection at low circulating fractions, are embodied in an exampleliquid biopsy workflow described in Examples 2 and 3. These examplesdescribe an example liquid biopsy assay employing a 105-genehybrid-capture next-generation sequencing (NGS) panel spanning 270 kb ofthe human genome, configured to detect targets in four variant classes,including single nucleotide variants (SNVs), insertions and/or deletions(indels), copy number variants (CNVs), and gene rearrangements. Toestablish robust clinical performance, extensive validation studies wereconducted that demonstrated high sensitivity and specificity.Accordingly, the example liquid biopsy assay detected actionablevariants with high accuracy in comparison to a commercial ctDNA NGS kit,commercial solid tumor biopsy-based assays, such as a solid tumor biopsyNGS tissue assay, and digital droplet PCR (ddPCR). As shown in theresults of FIG. 17, the methods and systems disclosed herein reducedfalse positive variant calling by 11.45% compared to conventionalvariant detection methods.

The identification of actionable genomic alterations in a patient'scancer genome is a difficult and computationally demanding problem. Forinstance, the determination of various prognostic metrics useful forprecision oncology, such as variant allelic ratio, copy numbervariation, tumor mutational burden, microsatellite instability status,etc., requires analysis of hundreds of millions to billions, ofsequenced nucleic acid bases. An example of a typical bioinformaticspipeline established for this purpose includes at least five stages ofanalysis: assessment of the quality of raw next generation sequencingdata, generation of collapsed nucleic acid fragment sequences andalignment of such sequences to a reference genome, detection ofstructural variants in the aligned sequence data, annotation ofidentified variants, and visualization of the data. See, Wadapurkar andVyas, Informatics in Medicine Unlocked, 11:75-82 (2018), the content ofwhich is hereby incorporated by reference, in its entirety, for allpurposes. Each one of these procedures is computationally taxing in itsown right.

For instance, the overall temporal and spatial computation complexity ofsimple global and local pairwise sequence alignment algorithms arequadratic in nature (e.g., second order problems), that increase rapidlyas a function of the size of the nucleic acid sequences (n and m) beingcompared. Specifically, the temporal and spatial complexities of thesesequence alignment algorithms can be estimated as O(mn), where O is theupper bound on the asymptotic growth rate of the algorithm, n is thenumber of bases in the first nucleic acid sequence, and m is the numberof bases in the second nucleic acid sequence. See, Baichoo and Ouzounis,BioSystems, 156-157:72-85 (2017), the content of which is herebyincorporated by reference, in its entirety, for all purposes. Given thatthe human genome contains more than 3 billion bases, these alignmentalgorithms are extremely computationally taxing, especially when used toanalyze next generation sequencing (NGS) data, which can generate morethan 3 billion sequence reads per reaction.

This is particularly true when performed in the context of a liquidbiopsy assay, because liquid biopsy samples contain a complex mixture ofshort DNA fragments originating from many different germline (e.g.,healthy) and diseased (e.g., cancerous) tissues. Thus, the cellularorigins of the sequence reads are unknown, and the sequence signalsoriginating from cancerous cells, which may constitute multiplesub-clonal populations, must be computationally deconvoluted fromsignals originating from germline and hematopoietic origins, in order toprovide relevant information about the subject's cancer. Thus, inaddition to the computationally taxing processes required to alignsequence reads to a human genome, there is a computation problem ofdetermining whether a particular abnormal signal, e.g., one or moresequence reads corresponding to a genomic alteration, (i) is not anartifact, and (ii) originated from a cancerous source in the subject.This is increasingly difficult during the early stages of cancer—whentreatment is presumably most effective—when only small amounts of ctDNAare diluted by germline and hematopoietic DNA.

Advantageously, the present disclosure provides various systems andmethods that improve the computational elucidation of actionable genomicalterations from a liquid biopsy sample of a cancer patient.Specifically, the present disclosure improves a method for identifyingvariants in ctDNA using a dynamic thresholding approach. As describedabove, the disclosed methods and systems are necessarilycomputer-implemented due to their complexity and heavy computationalrequirements, and thus solve a problem in the computing art.

Advantageously, the methods and systems described herein provide animprovement to the abovementioned technical problem (e.g., performingcomplex computer-implemented methods for identifying variants in ctDNAusing a dynamic thresholding approach). The methods described hereintherefore solve a problem in the computing art by improving uponconventional methods for identifying variants (e.g., actionableoncologic targets) for cancer diagnosis and treatment. For example, theapplication of Bayes' Theorem through the likelihood ratio test providesa means for improving detection of true positive variants and reducingdetection of false positive variants for clinically relevant biomarkers,thus improving the accuracy and precision of genomic alterationdetection in precision oncology.

The methods and systems described herein also improve precision oncologymethods for assigning and/or administering treatment because of theimproved accuracy of variation detection. The identification oftherapeutically actionable variants that can be included in a clinicalreport for patient and/or clinician review, and/or matched withappropriate therapies and/or clinical trials for treatment and/ormonitoring, allows for more accurate assignment of treatments.Furthermore, the removal of false positive variant detection reduces therisk of patients undergoing unnecessary or potentially harmful regimensdue to misdiagnoses.

Definitions

As used herein, the term “subject” refers to any living or non-livingorganism including, but not limited to, a human (e.g., a male human,female human, fetus, pregnant female, child, or the like), a non-humanmammal, or a non-human animal. Any human or non-human animal can serveas a subject, including but not limited to mammal, reptile, avian,amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine(e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig),camelid (e.g., camel, llama, alpaca), monkey, ape (e.g., gorilla,chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish,dolphin, whale and shark. In some embodiments, a subject is a male orfemale of any age (e.g., a man, a woman, or a child).

As used herein, the terms “control,” “control sample,” “reference,”“reference sample,” “normal,” and “normal sample” describe a sample froma non-diseased tissue. In some embodiments, such a sample is from asubject that does not have a particular condition (e.g., cancer). Inother embodiments, such a sample is an internal control from a subject,e.g., who may or may not have the particular disease (e.g., cancer), butis from a healthy tissue of the subject. For example, where a liquid orsolid tumor sample is obtained from a subject with cancer, an internalcontrol sample may be obtained from a healthy tissue of the subject,e.g., a white blood cell sample from a subject without a blood cancer ora solid germline tissue sample from the subject. Accordingly, areference sample can be obtained from the subject or from a database,e.g., from a second subject who does not have the particular disease(e.g., cancer).

As used herein the term “cancer,” “cancerous tissue,” or “tumor” refersto an abnormal mass of tissue in which the growth of the mass surpasses,and is not coordinated with, the growth of normal tissue, including bothsolid masses (e.g., as in a solid tumor) or fluid masses (e.g., as in ahematological cancer). A cancer or tumor can be defined as “benign” or“malignant” depending on the following characteristics: degree ofcellular differentiation including morphology and functionality, rate ofgrowth, local invasion and metastasis. A “benign” tumor can be welldifferentiated, have characteristically slower growth than a malignanttumor and remain localized to the site of origin. In addition, in somecases a benign tumor does not have the capacity to infiltrate, invade ormetastasize to distant sites. A “malignant” tumor can be a poorlydifferentiated (anaplasia), have characteristically rapid growthaccompanied by progressive infiltration, invasion, and destruction ofthe surrounding tissue. Furthermore, a malignant tumor can have thecapacity to metastasize to distant sites. Accordingly, a cancer cell isa cell found within the abnormal mass of tissue whose growth is notcoordinated with the growth of normal tissue. Accordingly, a “tumorsample” refers to a biological sample obtained or derived from a tumorof a subject, as described herein.

Non-limiting examples of cancer types include ovarian cancer, cervicalcancer, uveal melanoma, colorectal cancer, chromophobe renal cellcarcinoma, liver cancer, endocrine tumor, oropharyngeal cancer,retinoblastoma, biliary cancer, adrenal cancer, neural cancer,neuroblastoma, basal cell carcinoma, brain cancer, breast cancer,non-clear cell renal cell carcinoma, glioblastoma, glioma, kidneycancer, gastrointestinal stromal tumor, medulloblastoma, bladder cancer,gastric cancer, bone cancer, non-small cell lung cancer, thymoma,prostate cancer, clear cell renal cell carcinoma, skin cancer, thyroidcancer, sarcoma, testicular cancer, head and neck cancer (e.g., head andneck squamous cell carcinoma), meningioma, peritoneal cancer,endometrial cancer, pancreatic cancer, mesothelioma, esophageal cancer,small cell lung cancer, Her2 negative breast cancer, ovarian serouscarcinoma, HR+ breast cancer, uterine serous carcinoma, uterine corpusendometrial carcinoma, gastroesophageal junction adenocarcinoma,gallbladder cancer, chordoma, and papillary renal cell carcinoma.

As used herein, the terms “cancer state” or “cancer condition” refer toa characteristic of a cancer patient's condition, e.g., a diagnosticstatus, a type of cancer, a location of cancer, a primary origin of acancer, a cancer stage, a cancer prognosis, and/or one or moreadditional characteristics of a cancer (e.g., tumor characteristics suchas morphology, heterogeneity, size, etc.). In some embodiments, one ormore additional personal characteristics of the subject are used furtherdescribe the cancer state or cancer condition of the subject, e.g., age,gender, weight, race, personal habits (e.g., smoking, drinking, diet),other pertinent medical conditions (e.g., high blood pressure, dry skin,other diseases), current medications, allergies, pertinent medicalhistory, current side effects of cancer treatments and othermedications, etc.

As used herein, the term “liquid biopsy” sample refers to a liquidsample obtained from a subject that includes cell-free DNA. Examples ofliquid biopsy samples include, but are not limited to, blood, wholeblood, plasma, serum, urine, cerebrospinal fluid, fecal material,saliva, sweat, tears, pleural fluid, pericardial fluid, or peritonealfluid of the subject. In some embodiments, a liquid biopsy sample is acell-free sample, e.g., a cell free blood sample. In some embodiments, aliquid biopsy sample is obtained from a subject with cancer. In someembodiments, a liquid biopsy sample is collected from a subject with anunknown cancer status, e.g., for use in determining a cancer status ofthe subject. Likewise, in some embodiments, a liquid biopsy is collectedfrom a subject with a non-cancerous disorder, e.g., a cardiovasculardisease. In some embodiments, a liquid biopsy is collected from asubject with an unknown status for a non-cancerous disorder, e.g., foruse in determining a non-cancerous disorder status of the subject.

As used herein, the term “cell-free DNA” and “cfDNA” interchangeablyrefer to DNA fragments that circulate in a subject's body (e.g.,bloodstream) and originate from one or more healthy cells and/or fromone or more cancer cells. These DNA molecules are found outside cells,in bodily fluids such as blood, whole blood, plasma, serum, urine,cerebrospinal fluid, fecal material, saliva, sweat, sweat, tears,pleural fluid, pericardial fluid, or peritoneal fluid of a subject, andare believed to be fragments of genomic DNA expelled from healthy and/orcancerous cells, e.g., upon apoptosis and lysis of the cellularenvelope.

As used herein, the term “locus” refers to a position (e.g., a site)within a genome, e.g., on a particular chromosome. In some embodiments,a locus refers to a single nucleotide position, on a particularchromosome, within a genome. In some embodiments, a locus refers to agroup of nucleotide positions within a genome. In some instances, alocus is defined by a mutation (e.g., substitution, insertion, deletion,inversion, or translocation) of consecutive nucleotides within a cancergenome. In some instances, a locus is defined by a gene, a sub-genicstructure (e.g., a regulatory element, exon, intron, or combinationthereof), or a predefined span of a chromosome. Because normal mammaliancells have diploid genomes, a normal mammalian genome (e.g., a humangenome) will generally have two copies of every locus in the genome, orat least two copies of every locus located on the autosomal chromosomes,e.g., one copy on the maternal autosomal chromosome and one copy on thepaternal autosomal chromosome.

As used herein, the term “allele” refers to a particular sequence of oneor more nucleotides at a chromosomal locus. In a haploid organism, thesubject has one allele at every chromosomal locus. In a diploidorganism, the subject has two alleles at every chromosomal locus.

As used herein, the term “base pair” or “bp” refers to a unit consistingof two nucleobases bound to each other by hydrogen bonds. Generally, thesize of an organism's genome is measured in base pairs because DNA istypically double stranded. However, some viruses have single-strandedDNA or RNA genomes.

As used herein, the terms “genomic alteration,” “mutation,” and“variant” refer to a detectable change in the genetic material of one ormore cells. A genomic alteration, mutation, or variant can refer tovarious type of changes in the genetic material of a cell, includingchanges in the primary genome sequence at single or multiple nucleotidepositions, e.g., a single nucleotide variant (SNV), a multi-nucleotidevariant (MNV), an indel (e.g., an insertion or deletion of nucleotides),a DNA rearrangement (e.g., an inversion or translocation of a portion ofa chromosome or chromosomes), a variation in the copy number of a locus(e.g., an exon, gene, or a large span of a chromosome) (CNV), a partialor complete change in the ploidy of the cell, as well as in changes inthe epigenetic information of a genome, such as altered DNA methylationpatterns. In some embodiments, a mutation is a change in the geneticinformation of the cell relative to a particular reference genome, orone or more ‘normal’ alleles found in the population of the species ofthe subject. For instance, mutations can be found in both germline cells(e.g., non-cancerous, ‘normal’ cells) of a subject and in abnormal cells(e.g., pre-cancerous or cancerous cells) of the subject. As such, amutation in a germline of the subject (e.g., which is found insubstantially all ‘normal cells’ in the subject) is identified relativeto a reference genome for the species of the subject. However, many lociof a reference genome of a species are associated with several variantalleles that are significantly represented in the population of thesubject and are not associated with a diseased state, e.g., such thatthey would not be considered ‘mutations.’ By contrast, in someembodiments, a mutation in a cancerous cell of a subject can beidentified relative to either a reference genome of the subject or tothe subject's own germline genome. In certain instances, identificationof both types of variants can be informative. For instance, in someinstances, a mutation that is present in both the cancer genome of thesubject and the germline of the subject is informative for precisiononcology when the mutation is a so-called ‘driver mutation,’ whichcontributes to the initiation and/or development of a cancer. However,in other instances, a mutation that is present in both the cancer genomeof the subject and the germline of the subject is not informative forprecision oncology, e.g., when the mutation is a so-called ‘passengermutation,’ which does not contribute to the initiation and/ordevelopment of the cancer. Likewise, in some instances, a mutation thatis present in the cancer genome of the subject but not the germline ofthe subject is informative for precision oncology, e.g., where themutation is a driver mutation and/or the mutation facilitates atherapeutic approach, e.g., by differentiating cancer cells from normalcells in a therapeutically actionable way. However, in some instances, amutation that is present in the cancer genome but not the germline of asubject is not informative for precision oncology, e.g., where themutation is a passenger mutation and/or where the mutation fails todifferentiate the cancer cell from a germline cell in a therapeuticallyactionable way.

As used herein, the term “reference allele” refers to the sequence ofone or more nucleotides at a chromosomal locus that is either thepredominant allele represented at that chromosomal locus within thepopulation of the species (e.g., the “wild-type” sequence), or an allelethat is predefined within a reference genome for the species.

As used herein, the term “variant allele” refers to a sequence of one ormore nucleotides at a chromosomal locus that is either not thepredominant allele represented at that chromosomal locus within thepopulation of the species (e.g., not the “wild-type” sequence), or notan allele that is predefined within a reference sequence construct(e.g., a reference genome or set of reference genomes) for the species.In some instances, sequence isoforms found within the population of aspecies that do not affect a change in a protein encoded by the genome,or that result in an amino acid substitution that does not substantiallyaffect the function of an encoded protein, are not variant alleles.

As used herein, the term “variant allele fraction,” “VAF,” “allelicfraction,” or “AF” refers to the number of times a variant or mutantallele was observed (e.g., a number of reads supporting a candidatevariant allele) divided by the total number of times the position wassequenced (e.g., a total number of reads covering a candidate locus).

As used herein, the terms “variant fragment count” and “variant allelefragment count” interchangeably refer to a quantification, e.g., a rawor normalized count, of the number of sequences representing uniquecell-free DNA fragments encompassing a variant allele in a sequencingreaction. That is, a variant fragment count represents a count ofsequence reads representing unique molecules in the liquid biopsysample, after duplicate sequence reads in the raw sequencing data havebeen collapsed, e.g., through the use of unique molecular indices (UMI)and bagging, etc. as described herein.

As used herein, the term “germline variants” refers to genetic variantsinherited from maternal and paternal DNA. Germline variants may bedetermined through a matched tumor-normal calling pipeline.

As used herein, the term “somatic variants” refers to variants arisingas a result of dysregulated cellular processes associated withneoplastic cells, e.g., a mutation. Somatic variants may be detected viasubtraction from a matched normal sample.

As used herein, the term “single nucleotide variant” or “SNV” refers toa substitution of one nucleotide to a different nucleotide at a position(e.g., site) of a nucleotide sequence, e.g., a sequence read from anindividual. A substitution from a first nucleobase X to a secondnucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymineSNV may be denoted as “C>T.”

As used herein, the term “insertions and deletions” or “indels” refersto a variant resulting from the gain or loss of DNA base pairs within ananalyzed region.

As used herein, the term “copy number variation” or “CNV” refers to theprocess by which large structural changes in a genome associated withtumor aneuploidy and other dysregulated repair systems are detected.These processes are used to detect large scale insertions or deletionsof entire genomic regions. CNV is defined as structural insertions ordeletions greater than a certain base pair (“bp”) in size, such as 500bp.

As used herein, the term “gene fusion” refers to the product oflarge-scale chromosomal aberrations resulting in the creation of achimeric protein. These expressed products can be non-functional, orthey can be highly over or underactive. This can cause deleteriouseffects in cancer such as hyper-proliferative or anti-apoptoticphenotypes.

As used herein, the term “loss of heterozygosity” refers to the loss ofone copy of a segment (e.g., including part or all of one or more genes)of the genome of a diploid subject (e.g., a human) or loss of one copyof a sequence encoding a functional gene product in the genome of thediploid subject, in a tissue, e.g., a cancerous tissue, of the subject.As used herein, when referring to a metric representing loss ofheterozygosity across the entire genome of the subject, loss ofheterozygosity is caused by the loss of one copy of various segments inthe genome of the subject. Loss of heterozygosity across the entiregenome may be estimated without sequencing the entire genome of asubject, and such methods for such estimations based on gene paneltargeting-based sequencing methodologies are described in the art.Accordingly, in some embodiments, a metric representing loss ofheterozygosity across the entire genome of a tissue of a subject isrepresented as a single value, e.g., a percentage or fraction of thegenome. In some cases, a tumor is composed of various sub-clonalpopulations, each of which may have a different degree of loss ofheterozygosity across their respective genomes. Accordingly, in someembodiments, loss of heterozygosity across the entire genome of acancerous tissue refers to an average loss of heterozygosity across aheterogeneous tumor population. As used herein, when referring to ametric for loss of heterozygosity in a particular gene, e.g., a DNArepair protein such as a protein involved in the homologous DNArecombination pathway (e.g., BRCA1 or BRCA2), loss of heterozygosityrefers to complete or partial loss of one copy of the gene encoding theprotein in the genome of the tissue and/or a mutation in one copy of thegene that prevents translation of a full-length gene product, e.g., aframeshift or truncating (creating a premature stop codon in the gene)mutation in the gene of interest. In some cases, a tumor is composed ofvarious sub-clonal populations, each of which may have a differentmutational status in a gene of interest. Accordingly, in someembodiments, loss of heterozygosity for a particular gene of interest isrepresented by an average value for loss of heterozygosity for the geneacross all sequenced sub-clonal populations of the cancerous tissue. Inother embodiments, loss of heterozygosity for a particular gene ofinterest is represented by a count of the number of unique incidences ofloss of heterozygosity in the gene of interest across all sequencedsub-clonal populations of the cancerous tissue (e.g., the number ofunique frame-shift and/or truncating mutations in the gene identified inthe sequencing data).

As used herein, the term “microsatellites” refers to short, repeatedsequences of DNA. The smallest nucleotide repeated unit of amicrosatellite is referred to as the “repeated unit” or “repeat unit.”In some embodiments, the stability of a microsatellite locus isevaluated by comparing some metric of the distribution of the number ofrepeated units at a microsatellite locus to a reference number ordistribution.

As used herein, the term “microsatellite instability” or “MSI” refers toa genetic hypermutability condition associated with various cancers thatresults from impaired DNA mismatch repair (MMR) in a subject. Amongother phenotypes, MSI causes changes in the size of microsatellite loci,e.g., a change in the number of repeated units at microsatellite loci,during DNA replication. Accordingly, the size of microsatellite repeatsis varied in MSI cancers as compared to the size of the correspondingmicrosatellite repeats in the germline of a cancer subject. The term“Microsatellite Instability-High” or “MSI-H” refers to a state of acancer (e.g., a tumor) that has a significant MMR defect, resulting inmicrosatellite loci with significantly different lengths than thecorresponding microsatellite loci in normal cells of the sameindividual. The term “Microsatellite Stable” or “MSS” refers to a stateof a cancer (e.g., a tumor) without significant MMR defects, such thatthere is no significant difference between the lengths of themicrosatellite loci in cancerous cells and the lengths of thecorresponding microsatellite loci in normal (e.g., non-cancerous) cellsin the same individual. The term “Microsatellite Equivocal” or “MSE”refers to a state of a cancer (e.g., a tumor) having an intermediatemicrosatellite length phenotype, that cannot be clearly classified asMSI-H or MSS based on statistical cutoffs used to define those twocategories.

As used herein, the term “gene product” refers to an RNA (e.g., mRNA ormiRNA) or protein molecule transcribed or translated from a particulargenomic locus, e.g., a particular gene. The genomic locus can beidentified using a gene name, a chromosomal location, or any othergenetic mapping metric.

As used herein, the term “ratio” refers to any comparison of a firstmetric X, or a first mathematical transformation thereof X′ (e.g.,measurement of a number of units of a genomic sequence in a first one ormore biological samples or a first mathematical transformation thereof)to another metric Y or a second mathematical transformation thereof Y′(e.g., the number of units of a respective genomic sequence in a secondone or more biological samples or a second mathematical transformationthereof) expressed as X/Y, Y/X, log_(N)(X/Y), log_(N)(Y/X), X′/Y, Y/X′,log_(N)(X′/Y), or log_(N)(Y/X′), X/Y′, Y′/X, log_(N)(X/Y′),log_(N)(Y′/X), X′/Y′, Y′/X′, log_(N)(X′/Y′), or log_(N)(Y′/X′), where Nis any real number greater than 1 and where example mathematicaltransformations of X and Y include, but are not limited to. raising X orY to a power Z, multiplying X or Y by a constant Q, where Z and Q areany real numbers, and/or taking an M based logarithm of X and/or Y,where M is a real number greater than 1. In one non-limiting example, Xis transformed to X′ prior to ratio calculation by raising X by thepower of two (X²) and Y is transformed to Y′ prior to ratio calculationby raising Y by the power of 3.2 (Y^(3.2)) and the ratio of X and Y iscomputed as log₂(X′/Y′).

As used herein, the terms “expression level,” “abundance level,” orsimply “abundance” refers to an amount of a gene product, (an RNAspecies, e.g., mRNA or miRNA, or protein molecule) transcribed ortranslated by a cell, or an average amount of a gene product transcribedor translated across multiple cells. When referring to mRNA or proteinexpression, the term generally refers to the amount of any RNA orprotein species corresponding to a particular genomic locus, e.g., aparticular gene. However, in some embodiments, an expression level canrefer to the amount of a particular isoform of an mRNA or proteincorresponding to a particular gene that gives rise to multiple mRNA orprotein isoforms. The genomic locus can be identified using a gene name,a chromosomal location, or any other genetic mapping metric.

As used herein, the term “relative abundance” refers to a ratio of afirst amount of a compound measured in a sample, e.g., a gene product(an RNA species, e.g., mRNA or miRNA, or protein molecule) or nucleicacid fragments having a particular characteristic (e.g., aligning to aparticular locus or encompassing a particular allele), to a secondamount of a compound measured in a second sample. In some embodiments,relative abundance refers to a ratio of an amount of species of acompound to a total amount of the compound in the same sample. Forinstance, a ratio of the amount of mRNA transcripts encoding aparticular gene in a sample (e.g., aligning to a particular region ofthe exome) to the total amount of mRNA transcripts in the sample. Inother embodiments, relative abundance refers to a ratio of an amount ofa compound or species of a compound in a first sample to an amount ofthe compound of the species of the compound in a second sample. Forinstance, a ratio of a normalized amount of mRNA transcripts encoding aparticular gene in a first sample to a normalized amount of mRNAtranscripts encoding the particular gene in a second and/or referencesample.

As used herein, the terms “sequencing,” “sequence determination,” andthe like refer to any biochemical processes that may be used todetermine the order of biological macromolecules such as nucleic acidsor proteins. For example, sequencing data can include all or a portionof the nucleotide bases in a nucleic acid molecule such as an mRNAtranscript or a genomic locus.

As used herein, the term “genetic sequence” refers to a recordation of aseries of nucleotides present in a subject's RNA or DNA as determined bysequencing of nucleic acids from the subject.

As used herein, the term “sequence reads” or “reads” refers tonucleotide sequences produced by any nucleic acid sequencing processdescribed herein or known in the art. Reads can be generated from oneend of nucleic acid fragments (“single-end reads”) or from both ends ofnucleic acid fragments (e.g., paired-end reads, double-end reads). Thelength of the sequence read is often associated with the particularsequencing technology. High-throughput methods, for example, providesequence reads that can vary in size from tens to hundreds of base pairs(bp). In some embodiments, the sequence reads are of a mean, median oraverage length of about 15 bp to 900 bp long (e.g., about 20 bp, about25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp,about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp,about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, orabout 500 bp. In some embodiments, the sequence reads are of a mean,median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp,or 50,000 bp or more. Nanopore® sequencing, for example, can providesequence reads that can vary in size from tens to hundreds to thousandsof base pairs. Illumina® parallel sequencing, for example, can providesequence reads that do not vary as much, for example, most of thesequence reads can be smaller than 200 bp. A sequence read (orsequencing read) can refer to sequence information corresponding to anucleic acid molecule (e.g., a string of nucleotides). For example, asequence read can correspond to a string of nucleotides (e.g., about 20to about 150) from part of a nucleic acid fragment, can correspond to astring of nucleotides at one or both ends of a nucleic acid fragment, orcan correspond to nucleotides of the entire nucleic acid fragment. Asequence read can be obtained in a variety of ways, e.g., usingsequencing techniques or using probes, e.g., in hybridization arrays orcapture probes, or amplification techniques, such as the polymerasechain reaction (PCR) or linear amplification using a single primer orisothermal amplification.

As used herein, the term “read segment” refers to any form of nucleotidesequence read including the raw sequence reads obtained directly from anucleic acid sequencing technique or from a sequence derived therefrom,e.g., an aligned sequence read, a collapsed sequence read, or a stitchedsequence read.

As used herein, the term “read count” refers to the total number ofnucleic acid reads generated, which may or may not be equivalent to thenumber of nucleic acid molecules generated, during a nucleic acidsequencing reaction.

As used herein, the term “read-depth,” “sequencing depth,” or “depth”can refer to a total number of unique nucleic acid fragmentsencompassing a particular locus or region of the genome of a subjectthat are sequenced in a particular sequencing reaction. Sequencing depthcan be expressed as “Yx”, e.g., 50×, 100×, etc., where “Y” refers to thenumber of unique nucleic acid fragments encompassing a particular locusthat are sequenced in a sequencing reaction. In such a case, Y isnecessarily an integer, because it represents the actual sequencingdepth for a particular locus. Alternatively, read-depth, sequencingdepth, or depth can refer to a measure of central tendency (e.g., a meanor mode) of the number of unique nucleic acid fragments that encompassone of a plurality of loci or regions of the genome of a subject thatare sequenced in a particular sequencing reaction. For example, in someembodiments, sequencing depth refers to the average depth of every locusacross an arm of a chromosome, a targeted sequencing panel, an exome, oran entire genome. In such case, Y may be expressed as a fraction or adecimal, because it refers to an average coverage across a plurality ofloci. When a mean depth is recited, the actual depth for any particularlocus may be different than the overall recited depth. Metrics can bedetermined that provide a range of sequencing depths in which a definedpercentage of the total number of loci fall. For instance, a range ofsequencing depths within which 90% or 95%, or 99% of the loci fall. Asunderstood by the skilled artisan, different sequencing technologiesprovide different sequencing depths. For instance, low-pass whole genomesequencing can refer to technologies that provide a sequencing depth ofless than 5×, less than 4×, less than 3×, or less than 2×, e.g., fromabout 0.5× to about 3×.

As used herein, the term “sequencing breadth” refers to what fraction ofa particular reference exome (e.g., human reference exome), a particularreference genome (e.g., human reference genome), or part of the exome orgenome has been analyzed. Sequencing breadth can be expressed as afraction, a decimal, or a percentage, and is generally calculated as(the number of loci analyzed/the total number of loci in a referenceexome or reference genome). The denominator of the fraction can be arepeat-masked genome, and thus 100% can correspond to all of thereference genome minus the masked parts. A repeat-masked exome or genomecan refer to an exome or genome in which sequence repeats are masked(e.g., sequence reads align to unmasked portions of the exome orgenome). In some embodiments, any part of an exome or genome can bemasked and, thus, sequencing breadth can be evaluated for any desiredportion of a reference exome or genome. In some embodiments, “broadsequencing” refers to sequencing/analysis of at least 0.1% of an exomeor genome.

As used herein, the term “sequencing probe” refers to a molecule thatbinds to a nucleic acid with affinity that is based on the expectednucleotide sequence of the RNA or DNA present at that locus.

As used herein, the term “targeted panel” or “targeted gene panel”refers to a combination of probes for sequencing (e.g., bynext-generation sequencing) nucleic acids present in a biological samplefrom a subject (e.g., a tumor sample, liquid biopsy sample, germlinetissue sample, white blood cell sample, or tumor or tissue organoidsample), selected to map to one or more loci of interest on one or morechromosomes. An example set of loci/genes useful for precision oncology,e.g., via solid or liquid biopsy assay, that can be analyzed using atargeted panel is described in Table 1. In some embodiments, in additionto loci that are informative for precision oncology, a targeted panelincludes one or more probes for sequencing one or more of a lociassociated with a different medical condition, a loci used for internalcontrol purposes, or a loci from a pathogenic organism (e.g., anoncogenic pathogen).

As used herein, the term, “reference exome” refers to any sequenced orotherwise characterized exome, whether partial or complete, of anytissue from any organism or pathogen that may be used to referenceidentified sequences from a subject. Typically, a reference exome willbe derived from a subject of the same species as the subject whosesequences are being evaluated. Example reference exomes used for humansubjects as well as many other organisms are provided in the on-linegenome browser hosted by the National Center for BiotechnologyInformation (“NCBI”). An “exome” refers to the complete transcriptionalprofile of an organism or pathogen, expressed in nucleic acid sequences.As used herein, a reference sequence or reference exome often is anassembled or partially assembled exomic sequence from an individual ormultiple individuals. In some embodiments, a reference exome is anassembled or partially assembled exomic sequence from one or more humanindividuals. The reference exome can be viewed as a representativeexample of a species' set of expressed genes. In some embodiments, areference exome comprises sequences assigned to chromosomes.

As used herein, the term “reference genome” refers to any sequenced orotherwise characterized genome, whether partial or complete, of anyorganism or pathogen that may be used to reference identified sequencesfrom a subject. Typically, a reference genome will be derived from asubject of the same species as the subject whose sequences are beingevaluated. Exemplary reference genomes used for human subjects as wellas many other organisms are provided in the on-line genome browserhosted by the National Center for Biotechnology Information (“NCBI”) orthe University of California, Santa Cruz (UCSC). A “genome” refers tothe complete genetic information of an organism or pathogen, expressedin nucleic acid sequences. As used herein, a reference sequence orreference genome often is an assembled or partially assembled genomicsequence from an individual or multiple individuals. In someembodiments, a reference genome is an assembled or partially assembledgenomic sequence from one or more human individuals. The referencegenome can be viewed as a representative example of a species' set ofgenes. In some embodiments, a reference genome comprises sequencesassigned to chromosomes. Exemplary human reference genomes include butare not limited to NCBI build 34 (UCSC equivalent: hg16), NCBI build 35(UCSC equivalent: hg17), NCBI build 36.1 (UCSC equivalent: hg18), GRCh37(UCSC equivalent: hg19), and GRCh38 (UCSC equivalent: hg38). For ahaploid genome, there can be only one nucleotide at each locus. For adiploid genome, heterozygous loci can be identified; each heterozygouslocus can have two alleles, where either allele can allow a match foralignment to the locus.

As used herein, the term “bioinformatics pipeline” refers to a series ofprocessing stages used to determine characteristics of a subject'sgenome or exome based on sequencing data of the subject's genome orexome. A bioinformatics pipeline may be used to determinecharacteristics of a germline genome or exome of a subject and/or acancer genome or exome of a subject. In some embodiments, the pipelineextracts information related to genomic alterations in the cancer genomeof a subject, which is useful for guiding clinical decisions forprecision oncology, from sequencing results of a biological sample,e.g., a tumor sample, liquid biopsy sample, reference normal sample,etc., from the subject. Certain processing stages in a bioinformaticsmay be ‘connected,’ meaning that the results of a first respectiveprocessing stage are informative and/or essential for execution of asecond, downstream processing stage. For instance, in some embodiments,a bioinformatics pipeline includes a first respective processing stagefor identifying genomic alterations that are unique to the cancer genomeof a subject and a second respective processing stage that uses thequantity and/or identity of the identified genomic alterations todetermine a metric that is informative for precision oncology, e.g., atumor mutational burden. In some embodiments, the bioinformaticspipeline includes a reporting stage that generates a report of relevantand/or actionable information identified by upstream stages of thepipeline, which may or may not further include recommendations foraiding clinical therapy decisions.

As used herein, the term “limit of detection” or “LOD” refers to theminimal quantity of a feature that can be identified with a particularlevel of confidence. Accordingly, level of detection can be used todescribe an amount of a substance that must be present in order for aparticular assay to reliably detect the substance. A level of detectioncan also be used to describe a level of support needed for an algorithmto reliably identify a genomic alteration based on sequencing data. Forexample, a minimal number of unique sequence reads to supportidentification of a sequence variant such as a SNV.

As used herein, the term “BAM File” or “Binary file containing AlignmentMaps” refers to a file storing sequencing data aligned to a referencesequence (e.g., a reference genome or exome). In some embodiments, a BAMfile is a compressed binary version of a SAM (Sequence Alignment Map)file that includes, for each of a plurality of unique sequence reads, anidentifier for the sequence read, information about the nucleotidesequence, information about the alignment of the sequence to a referencesequence, and optionally metrics relating to the quality of the sequenceread and/or the quality of the sequence alignment. While BAM filesgenerally relate to files having a particular format, for simplicitythey are used herein to simply refer to a file, of any format,containing information about a sequence alignment, unless specificallystated otherwise.

As used herein, the term “measure of central tendency” refers to acentral or representative value for a distribution of values.Non-limiting examples of measures of central tendency include anarithmetic mean, weighted mean, midrange, midhinge, trimean, geometricmean, geometric median, Winsorized mean, median, and mode of thedistribution of values.

As used herein, the term “Positive Predictive Value” or “PPV” means thelikelihood that a variant is properly called given that a variant hasbeen called by an assay. PPV can be expressed as (number of truepositives)/(number of false positives+number of true positives).

As used herein, the term “assay” refers to a technique for determining aproperty of a substance, e.g., a nucleic acid, a protein, a cell, atissue, or an organ. An assay (e.g., a first assay or a second assay)can comprise a technique for determining the copy number variation ofnucleic acids in a sample, the methylation status of nucleic acids in asample, the fragment size distribution of nucleic acids in a sample, themutational status of nucleic acids in a sample, or the fragmentationpattern of nucleic acids in a sample. Any assay known to a person havingordinary skill in the art can be used to detect any of the properties ofnucleic acids mentioned herein. Properties of a nucleic acids caninclude a sequence, genomic identity, copy number, methylation state atone or more nucleotide positions, size of the nucleic acid, presence orabsence of a mutation in the nucleic acid at one or more nucleotidepositions, and pattern of fragmentation of a nucleic acid (e.g., thenucleotide position(s) at which a nucleic acid fragments). An assay ormethod can have a particular sensitivity and/or specificity, and theirrelative usefulness as a diagnostic tool can be measured using ROC-AUCstatistics.

As used herein, the term “classification” can refer to any number(s) orother characters(s) that are associated with a particular property of asample. For example, in some embodiments, the term “classification” canrefer to a type of cancer in a subject, a stage of cancer in a subject,a prognosis for a cancer in a subject, a tumor load, a presence of tumormetastasis in a subject, and the like. The classification can be binary(e.g., positive or negative) or have more levels of classification(e.g., a scale from 1 to 10 or 0 to 1). The terms “cutoff” and“threshold” can refer to predetermined numbers used in an operation. Forexample, a cutoff size can refer to a size above which fragments areexcluded. A threshold value can be a value above or below which aparticular classification applies. Either of these terms can be used ineither of these contexts.

As used herein, the term “sensitivity” or “true positive rate” (TPR)refers to the number of true positives divided by the sum of the numberof true positives and false negatives. Sensitivity can characterize theability of an assay or method to correctly identify a proportion of thepopulation that truly has a condition. For example, sensitivity cancharacterize the ability of a method to correctly identify the number ofsubjects within a population having cancer. In another example,sensitivity can characterize the ability of a method to correctlyidentify the one or more markers indicative of cancer.

As used herein, the term “specificity” or “true negative rate” (TNR)refers to the number of true negatives divided by the sum of the numberof true negatives and false positives. Specificity can characterize theability of an assay or method to correctly identify a proportion of thepopulation that truly does not have a condition. For example,specificity can characterize the ability of a method to correctlyidentify the number of subjects within a population not having cancer.In another example, specificity characterizes the ability of a method tocorrectly identify one or more markers indicative of cancer.

As used herein, an “actionable genomic alteration” or “actionablevariant” refers to a genomic alteration (e.g., a SNV, MNV, indel,rearrangement, copy number variation, or ploidy variation), or value ofanother cancer metric derived from nucleic acid sequencing data (e.g., atumor mutational burden, MSI status, or tumor fraction), that is knownor believed to be associated with a therapeutic course of action that ismore likely to produce a positive effect in a cancer patient that hasthe actionable variant than in a similarly situated cancer patient thatdoes not have the actionable variant. For instance, administration ofEGFR inhibitors (e.g., afatinib, erlotinib, gefitinib) is more effectivefor treating non-small cell lung cancer in patients with an EGFRmutation in exons 19/21 than for treating non-small cell lung cancer inpatients that do not have an EGFR mutations in exons 19/21. Accordingly,an EGFR mutation in exon 19/21 is an actionable variant. In someinstances, an actionable variant is only associated with an improvedtreatment outcome in one or a group of specific cancer types. In otherinstances, an actionable variant is associated with an improvedtreatment outcome in substantially all cancer types.

As used herein, a “variant of uncertain significance” or “VUS” refers toa genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copynumber variation, or ploidy variation), or value of another cancermetric derived from nucleic acid sequencing data (e.g., a tumormutational burden, MSI status, or tumor fraction), whose impact ondisease development/progression is unknown.

As used herein, a “benign variant” or “likely benign variant” refers toa genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copynumber variation, or ploidy variation), or value of another cancermetric derived from nucleic acid sequencing data (e.g., a tumormutational burden, MSI status, or tumor fraction), that is known orbelieved to not contribute to disease development/progression.

As used herein, a “pathogenic variant” or “likely pathogenic variant”refers to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement,copy number variation, or ploidy variation), or value of another cancermetric derived from nucleic acid sequencing data (e.g., a tumormutational burden, MSI status, or tumor fraction), that is known orbelieved to contribute to disease development/progression.

As used herein, an “effective amount” or “therapeutically effectiveamount” is an amount sufficient to affect a beneficial or desiredclinical result upon treatment. An effective amount can be administeredto a subject in one or more doses. In terms of treatment, an effectiveamount is an amount that is sufficient to palliate, ameliorate,stabilize, reverse or slow the progression of the disease, or otherwisereduce the pathological consequences of the disease. The effectiveamount is generally determined by the physician on a case-by-case basisand is within the skill of one in the art. Several factors are typicallytaken into account when determining an appropriate dosage to achieve aneffective amount. These factors include age, sex and weight of thesubject, the condition being treated, the severity of the condition andthe form and effective concentration of the therapeutic agent beingadministered.

The terminology used in the present disclosure is for the purpose ofdescribing particular embodiments only and is not intended to belimiting of the invention. As used in the description of the inventionand the appended claims, the singular forms “a”, “an” and “the” areintended to include the plural forms as well, unless the context clearlyindicates otherwise. It will also be understood that the term “and/or”as used herein refers to and encompasses any and all possiblecombinations of one or more of the associated listed items. It will befurther understood that the terms “comprises” and/or “comprising,” whenused in this specification, specify the presence of stated features,integers, steps, operations, elements, and/or components, but do notpreclude the presence or addition of one or more other features,integers, steps, operations, elements, components, and/or groupsthereof. Furthermore, to the extent that the terms “including,”“includes,” “having,” “has,” “with,” or variants thereof are used ineither the detailed description and/or the claims, such terms areintended to be inclusive in a manner similar to the term “comprising.”

As used herein, the term “if” may be construed to mean “when” or “upon”or “in response to determining” or “in response to detecting,” dependingon the context. Similarly, the phrase “if it is determined” or “if [astated condition or event] is detected” may be construed to mean “upondetermining” or “in response to determining” or “upon detecting [thestated condition or event]” or “in response to detecting [the statedcondition or event],” depending on the context.

It will also be understood that, although the terms first, second, etc.may be used herein to describe various elements, these elements shouldnot be limited by these terms. These terms are only used to distinguishone element from another. For example, a first subject could be termed asecond subject, and, similarly, a second subject could be termed a firstsubject, without departing from the scope of the present disclosure. Thefirst subject and the second subject are both subjects, but they are notthe same subject. Furthermore, the terms “subject,” “user,” and“patient” are used interchangeably herein.

Reference will now be made in detail to embodiments, examples of whichare illustrated in the accompanying drawings. In the following detaileddescription, numerous specific details are set forth in order to providea thorough understanding of the present disclosure, including examplesystems, methods, techniques, instruction sequences, and computingmachine program products that embody illustrative implementations.However, the illustrative discussions below are not intended to beexhaustive or to limit the implementations to the precise formsdisclosed. Many modifications and variations are possible in view of theabove teachings. The features described herein are not limited by theillustrated ordering of acts or events, as some acts can occur indifferent orders and/or concurrently with other acts or events.

The implementations provided herein are chosen and described in order tobest explain the principles and their practical applications, to therebyenable others skilled in the art to best utilize the various embodimentswith various modifications as are suited to the particular usecontemplated. In some instances, well-known methods, procedures,components, circuits, and networks have not been described in detail soas not to unnecessarily obscure aspects of the embodiments. In otherinstances, it will be apparent to one of ordinary skill in the art thatthe present disclosure may be practiced without one or more of thespecific details.

It will be appreciated that, in the development of any such actualimplementation, numerous implementation-specific decisions are made inorder to achieve the designer's specific goals, such as compliance withuse case- and business-related constraints, and that these specificgoals will vary from one implementation to another and from one designerto another. Moreover, it will be appreciated that though such a designeffort might be complex and time-consuming, it will nevertheless be aroutine undertaking of engineering for those of ordering skill in theart having the benefit of the present disclosure.

Example System Embodiments

Now that an overview of some aspects of the present disclosure and somedefinitions used in the present disclosure have been provided, detailsof an exemplary system for providing clinical support for personalizedcancer therapy using a liquid biopsy assay are now described inconjunction with FIGS. 1A-1D. FIGS. 1A-1D collectively illustrate thetopology of an example system for providing clinical support forpersonalized cancer therapy using a liquid biopsy assay, in accordancewith some embodiments of the present disclosure. Advantageously, theexample system illustrated in FIGS. 1A-1D improves upon conventionalmethods for providing clinical support for personalized cancer therapyby validating a somatic sequence variant in a test subject having acancer condition.

FIG. 1A is a block diagram illustrating a system in accordance with someimplementations. The device 100 in some implementations includes one ormore processing units CPU(s) 102 (also referred to as processors), oneor more network interfaces 104, a user interface 106, e.g., including adisplay 108 and/or an input 110 (e.g., a mouse, touchpad, keyboard,etc.), a non-persistent memory 111, a persistent memory 112, and one ormore communication buses 114 for interconnecting these components. Theone or more communication buses 114 optionally include circuitry(sometimes called a chipset) that interconnects and controlscommunications between system components. The non-persistent memory 111typically includes high-speed random access memory, such as DRAM, SRAM,DDR RAM, ROM, EEPROM, flash memory, whereas the persistent memory 112typically includes CD-ROM, digital versatile disks (DVD) or otheroptical storage, magnetic cassettes, magnetic tape, magnetic diskstorage or other magnetic storage devices, magnetic disk storagedevices, optical disk storage devices, flash memory devices, or othernon-volatile solid state storage devices. The persistent memory 112optionally includes one or more storage devices remotely located fromthe CPU(s) 102. The persistent memory 112, and the non-volatile memorydevice(s) within the non-persistent memory 112, comprise non-transitorycomputer readable storage medium. In some implementations, thenon-persistent memory 111 or alternatively the non-transitory computerreadable storage medium stores the following programs, modules and datastructures, or a subset thereof, sometimes in conjunction with thepersistent memory 112:

-   -   an operating system 116, which includes procedures for handling        various basic system services and for performing hardware        dependent tasks;    -   a network communication module (or instructions) 118 for        connecting the system 100 with other devices and/or a        communication network 105;    -   a test patient data store 120 for storing one or more        collections of features from patients (e.g., subjects);    -   a bioinformatics module 140 for processing sequencing data and        extracting features from sequencing data, e.g., from liquid        biopsy sequencing assays;    -   a feature analysis module 160 for evaluating patient features,        e.g., genomic alterations, compound genomic features, and        clinical features; and    -   a reporting module 180 for generating and transmitting reports        that provide clinical support for personalized cancer therapy.

Although FIGS. 1A-1D depict a “system 100,” the figures are intendedmore as a functional description of the various features that may bepresent in computer systems than as a structural schematic of theimplementations described herein. In practice, and as recognized bythose of ordinary skill in the art, items shown separately could becombined and some items could be separated. Moreover, although FIG. 1depicts certain data and modules in non-persistent memory 111, some orall of these data and modules may be in persistent memory 112. Forexample, in various implementations, one or more of the above identifiedelements are stored in one or more of the previously mentioned memorydevices and correspond to a set of instructions for performing afunction described above. The above identified modules, data, orprograms (e.g., sets of instructions) need not be implemented asseparate software programs, procedures, datasets, or modules, and thusvarious subsets of these modules and data may be combined or otherwisere-arranged in various implementations.

In some implementations, the non-persistent memory 111 optionally storesa subset of the modules and data structures identified above.Furthermore, in some embodiments, the memory stores additional modulesand data structures not described above. In some embodiments, one ormore of the above-identified elements is stored in a computer system,other than that of system 100, that is addressable by system 100 so thatsystem 100 may retrieve all or a portion of such data when needed.

For purposes of illustration in FIG. 1A, system 100 is represented as asingle computer that includes all of the functionality for providingclinical support for personalized cancer therapy. However, while asingle machine is illustrated, the term “system” shall also be taken toinclude any collection of machines that individually or jointly executea set (or multiple sets) of instructions to perform any one or more ofthe methodologies discussed herein.

For example, in some embodiments, system 100 includes one or morecomputers. In some embodiments, the functionality for providing clinicalsupport for personalized cancer therapy is spread across any number ofnetworked computers and/or resides on each of several networkedcomputers and/or is hosted on one or more virtual machines at a remotelocation accessible across the communications network 105. For example,different portions of the various modules and data stores illustrated inFIGS. 1A-1D can be stored and/or executed on the various instances of aprocessing device and/or processing server/database in the distributeddiagnostic environment 210 illustrated in FIG. 2B (e.g., processingdevices 224, 234, 244, and 254, processing server 262, and database264).

The system may operate in the capacity of a server or a client machinein client-server network environment, as a peer machine in apeer-to-peer (or distributed) network environment, or as a server or aclient machine in a cloud computing infrastructure or environment. Thesystem may be a personal computer (PC), a tablet PC, a set-top box(STB), a Personal Digital Assistant (PDA), a cellular telephone, a webappliance, a server, a network router, a switch or bridge, or anymachine capable of executing a set of instructions (sequential orotherwise) that specify actions to be taken by that machine.

In another implementation, the system comprises a virtual machine thatincludes a module for executing instructions for performing any one ormore of the methodologies disclosed herein. In computing, a virtualmachine (VM) is an emulation of a computer system that is based oncomputer architectures and provides functionality of a physicalcomputer. Some such implementations may involve specialized hardware,software, or a combination of hardware and software.

One of skill in the art will appreciate that any of a wide array ofdifferent computer topologies are used for the application and all suchtopologies are within the scope of the present disclosure.

Test Patient Data Store (120)

Referring to FIG. 1B, in some embodiments, the system (e.g., system 100)includes a patient data store 120 that stores data for patients 121-1 to121-M (e.g., cancer patients or patients being tested for cancer)including one or more sequencing data 122, feature data 125, andclinical assessments 139. These data are used and/or generated by thevarious processes stored in the bioinformatics module 140 and featureanalysis module 160 of system 100, to ultimately generate a reportproviding clinical support for personalized cancer therapy of a patient.While the feature scope of patient data 121 across all patients may beinformationally dense, an individual patient's feature set may besparsely populated across the entirety of the collective feature scopeof all features across all patients. That is to say, the data stored forone patient may include a different set of features that the data storedfor another patient. Further, while illustrated as a single dataconstruct in FIG. 1B, different sets of patient data may be stored indifferent databases or modules spread across one or more systemmemories.

In some embodiments, sequencing data 122 from one or more sequencingreactions 122-i, including a plurality of sequence reads 123-i-1 to123-i-K, is stored in the test patient data store 120. The data storemay include different sets of sequencing data from a single subject,corresponding to different samples from the patient, e.g., a tumorsample, liquid biopsy sample, tumor organoid derived from a patienttumor, and/or a normal sample, and/or to samples acquired at differenttimes, e.g., while monitoring the progression, regression, remission,and/or recurrence of a cancer in a subject. The sequence reads may be inany suitable file format, e.g., BCL, FASTA, FASTQ, etc. In someembodiments, sequencing data 122 is accessed by a sequencing dataprocessing module 141, which performs various pre-processing, genomealignment, and demultiplexing operations, as described in detail belowwith reference to bioinformatics module 140. In some embodiments,sequence data that has been aligned to a reference construct, e.g., BAMfile 124, is stored in test patient data store 120.

In some embodiments, the test patient data store 120 includes featuredata 125, e.g., that is useful for identifying clinical support forpersonalized cancer therapy. In some embodiments, the feature data 125includes personal characteristics 126 of the patient, such as patientname, date of birth, gender, ethnicity, physical address, smokingstatus, alcohol consumption characteristic, anthropomorphic data, etc.

In some embodiments, the feature data 125 includes medical history data127 for the patient, such as cancer diagnosis information (e.g., date ofinitial diagnosis, date of metastatic diagnosis, cancer staging, tumorcharacterization, tissue of origin, previous treatments and outcomes,adverse effects of therapy, therapy group history, clinical trialhistory, previous and current medications, surgical history, etc.),previous or current symptoms, previous or current therapies, previoustreatment outcomes, previous disease diagnoses, diabetes status,diagnoses of depression, diagnoses of other physical or mental maladies,and family medical history. In some embodiments, the feature data 125includes clinical features 128, such as pathology data 128-1, medicalimaging data 128-2, and tissue culture and/or tissue organoid culturedata 128-3.

In some embodiments, yet other clinical features, such as previouslaboratory testing results, are stored in the test patient data store120. Medical history data 127 and clinical features may be collectedfrom various sources, including at intake directly from the patient,from an electronic medical record (EMR) or electronic health record(EHR) for the patient, or curated from other sources, such as fieldsfrom various testing records (e.g., genetic sequencing reports).

In some embodiments, the feature data 125 includes genomic features 131for the patient. Non-limiting examples of genomic features includeallelic states 132 (e.g., the identity of alleles at one or more loci,support for wild type or variant alleles at one or more loci, supportfor SNVs/MNVs at one or more loci, support for indels at one or moreloci, and/or support for gene rearrangements at one or more loci),allelic fractions 133 (e.g., ratios of variant to reference alleles (orvice versa), methylation states 132 (e.g., a distribution of methylationpatterns at one or more loci and/or support for aberrant methylationpatterns at one or more loci), genomic copy numbers 135 (e.g., a copynumber value at one or more loci and/or support for an aberrant(increased or decreased) copy number at one or more loci), tumormutational burden 136 (e.g., a measure of the number of mutations in thecancer genome of the subject), and microsatellite instability status 137(e.g., a measure of the repeated unit length at one or moremicrosatellite loci and/or a classification of the MSI status for thepatient's cancer). In some embodiments, one or more of the genomicfeatures 131 are determined by a nucleic acid bioinformatics pipeline,e.g., as described in detail below with reference to FIGS. 4A-4F. Inparticular, in some embodiments, the feature data 125 include variantallele fractions 133, as determined using the improved methods forvalidating somatic sequence variants and described in further detailbelow with reference to FIGS. 1C, 1D, and 4F. In some embodiments, oneor more of the genomic features 131 are obtained from an externaltesting source, e.g., not connected to the bioinformatics pipeline asdescribed below.

In some embodiments, the feature data 125 further includes data 138 fromother -omics fields of study. Non-limiting examples of -omics fields ofstudy that may yield feature data useful for providing clinical supportfor personalized cancer therapy include transcriptomics, epigenomics,proteomics, metabolomics, metabonomics, microbiomics, lipidomics,glycomics, cellomics, and organoidomics.

In some embodiments, yet other features may include features derivedfrom machine learning approaches, e.g., based at least in part onevaluation of any relevant molecular or clinical features, consideredalone or in combination, not limited to those listed above. Forinstance, in some embodiments, one or more latent features learned fromevaluation of cancer patient training datasets improve the diagnosticand prognostic power of the various analysis algorithms in the featureanalysis module 160.

The skilled artisan will know of other types of features useful forproviding clinical support for personalized cancer therapy. The listingof features above is merely representative and should not be construedto be limiting.

In some embodiments, a test patient data store 120 includes clinicalassessment data 139 for patients, e.g., based on the feature data 125collected for the subject. In some embodiments, the clinical assessmentdata 139 includes a catalogue of actionable variants and characteristics139-1 (e.g., genomic alterations and compound metrics based on genomicfeatures known or believed to be targetable by one or more specificcancer therapies), matched therapies 139-2 (e.g., the therapies known orbelieved to be particularly beneficial for treatment of subjects havingactionable variants), and/or clinical reports 139-3 generated for thesubject, e.g., based on identified actionable variants andcharacteristics 139-1 and/or matched therapies 139-2.

In some embodiments, clinical assessment data 139 is generated byanalysis of feature data 125 using the various algorithms of featureanalysis module 160, as described in further detail below. In someembodiments, clinical assessment data 139 is generated, modified, and/orvalidated by evaluation of feature data 125 by a clinician, e.g., anoncologist. For instance, in some embodiments, a clinician (e.g., atclinical environment 220) uses feature analysis module 160, or accessestest patient data store 120 directly, to evaluate feature data 125 tomake recommendations for personalized cancer treatment of a patient.Similarly, in some embodiments, a clinician (e.g., at clinicalenvironment 220) reviews recommendations determined using featureanalysis module 160 and approves, rejects, or modifies therecommendations, e.g., prior to the recommendations being sent to amedical professional treating the cancer patient.

Bioinformatics Module (140)

Referring again to FIG. 1A, the system (e.g., system 100) includes abioinformatics module 140 that includes a feature extraction module 145and optional ancillary data processing constructs, such as a sequencedata processing module 141 and/or one or more reference sequenceconstructs 158 (e.g., a reference genome, exome, or targeted-panelconstruct that includes reference sequences for a plurality of locitargeted by a sequencing panel).

In some embodiments, bioinformatics module 140 includes a sequence dataprocessing module 141 that includes instructions for processing sequencereads, e.g., raw sequence reads 123 from one or more sequencingreactions 122, prior to analysis by the various feature extractionalgorithms, as described in detail below. In some embodiments, sequencedata processing module 141 includes one or more pre-processingalgorithms 142 that prepare the data for analysis. In some embodiments,the pre-processing algorithms 142 include instructions for convertingthe file format of the sequence reads from the output of the sequencer(e.g., a BCL file format) into a file format compatible with downstreamanalysis of the sequences (e.g., a FASTQ or FASTA file format). In someembodiments, the pre-processing algorithms 142 include instructions forevaluating the quality of the sequence reads (e.g., by interrogatingquality metrics like Phred score, base-calling error probabilities,Quality (Q) scores, and the like) and/or removing sequence reads that donot satisfy a threshold quality (e.g., an inferred base call accuracy ofat least 80%, at least 90%, at least 95%, at least 99%, at least 99.5%,at least 99.9%, or higher). In some embodiments, the pre-processingalgorithms 142 include instructions for filtering the sequence reads forone or more properties, e.g., removing sequences failing to satisfy alower or upper size threshold or removing duplicate sequence reads.

In some embodiments, sequence data processing module 141 includes one ormore alignment algorithms 143, for aligning pre-processed sequence reads123 to a reference sequence construct 158, e.g., a reference genome,exome, or targeted-panel construct. Many algorithms for aligningsequencing data to a reference construct are known in the art, forexample, BWA, Blat, SHRiMP, LastZ, and MAQ. One example of a sequenceread alignment package is the Burrows-Wheeler Alignment tool (BWA),which uses a Burrows-Wheeler Transform (BWT) to align short sequencereads against a large reference construct, allowing for mismatches andgaps. Li and Durbin, Bioinformatics, 25(14):1754-60 (2009), the contentof which is incorporated herein by reference, in its entirety, for allpurposes. Sequence read alignment packages import raw or pre-processedsequence reads 122, e.g., in BCL, FASTA, or FASTQ file formats, andoutput aligned sequence reads 124, e.g., in SAM or BAM file formats.

In some embodiments, sequence data processing module 141 includes one ormore demultiplexing algorithms 144, for dividing sequence read orsequence alignment files generated from sequencing reactions of poolednucleic acids into separate sequence read or sequence alignment files,each of which corresponds to a different source of nucleic acids in thenucleic acid sequencing pool. For instance, because of the cost ofsequencing reactions, it is common practice to pool nucleic acids from aplurality of samples into a single sequencing reaction. The nucleicacids from each sample are tagged with a sample-specific and/ormolecule-specific sequence tag (e.g., a UMI), which is sequenced alongwith the molecule. In some embodiments, demultiplexing algorithms 144sort these sequence tags in the sequence read or sequence alignmentfiles to demultiplex the sequencing data into separate files for each ofthe samples included in the sequencing reaction.

Bioinformatics module 140 includes a feature extraction module 145,which includes instructions for identifying diagnostic features, e.g.,genomic features 131, from sequencing data 122 of biological samplesfrom a subject, e.g., one or more of a solid tumor sample, a liquidbiopsy sample, or a normal tissue (e.g., control) sample. For instance,in some embodiments, a feature extraction algorithm compares theidentity of one or more nucleotides at a locus from the sequencing data122 to the identity of the nucleotides at that locus in a referencesequence construct (e.g., a reference genome, exome, or targeted-panelconstruct) to determine whether the subject has a variant at that locus.In some embodiments, a feature extraction algorithm evaluates data otherthan the raw sequence, to identify a genomic alteration in the subject,e.g., an allelic ratio, a relative copy number, a repeat unitdistribution, etc.

For instance, in some embodiments, feature extraction module 145includes one or more variant identification modules that includeinstructions for various variant calling processes. In some embodiments,variants in the germline of the subject are identified, e.g., using agermline variant identification module 146. In some embodiments,variants in the cancer genome, e.g., somatic variants, are identified,e.g., using a somatic variant identification module 150. While separategermline and somatic variant identification modules are illustrated inFIG. 1A, in some embodiments they are integrated into a single module.In some embodiments, the variant identification module includesinstructions for identifying one or more of nucleotide variants (e.g.,single nucleotide variants (SNV) and multi-nucleotide variants (MNV))using one or more SNV/MNV calling algorithms (e.g., algorithms 147and/or 151), indels (e.g., insertions or deletions of nucleotides) usingone or more indel calling algorithms (e.g., algorithms 148 and/or 152),and genomic rearrangements (e.g., inversions, translocation, and fusionsof nucleotide sequences) using one or more genomic rearrangement callingalgorithms (e.g., algorithms 149 and/or 153).

For example, referring to FIGS. 1C and 1D, in some embodiments, featureextraction module 145 comprises, in the variant identification module146, a variant thresholding module 146-a, a sequence variant data store146-r, and a variant validation module 146-o. In some such embodiments,the sequence variant data store 146-r comprises one or more candidatevariants for a test subject identified by aligning to a referencesequence a plurality of sequence reads obtained from sequencing a liquidbiopsy sample of the test subject, the one or more candidate variantscorresponding to a respective one or more loci in the referencesequence. The plurality of sequence reads aligned to the referencesequence is used to identify a variant allele fragment count for eachcandidate variant. The sequence variant data store 146-r furthercomprises, in some embodiments, a plurality of variants from a first setof nucleic acids obtained from a cohort of subjects (e.g., from a tumortissue biopsy for each subject in a baseline cohort of subjects). Thevariant thresholding module 146-a performs a function for each candidatevariant in the one or more candidate variants where, for eachcorresponding locus 146-b (e.g., 146-b-1, . . . , 146-b-P), a dynamicvariant count threshold 146-d (e.g., 146-d-1) is obtained based on apre-test odds of a positive variant call for the locus, based on theprevalence of variants in the genomic region that includes the locus,using the plurality of variants for the baseline cohort. The variantthresholding module 146-a compares the variant allele fragment count146-c (e.g., 146-c-1) for the candidate variant against the dynamicvariant count threshold 146-d for the locus corresponding to thecandidate variant. In some embodiments, the variant validation module146-o determines whether the candidate variant is validated or rejectedas a somatic sequence variant based on the comparison. For example, whenthe variant allele fragment count for the candidate variant satisfiesthe dynamic variant count threshold for the locus, the somatic sequencevariant is validated, and when the variant allele fragment count for thecandidate variant does not satisfy the dynamic variant count thresholdfor the locus, the somatic sequence variant is rejected.

In some embodiments, the dynamic variant count threshold is determinedbased on a distribution of variant detection sensitivities as a functionof circulating variant allele fraction from the cohort of subjects(e.g., the baseline cohort). For example, referring to FIG. 1C, in somesuch embodiments, the variant thresholding module 146-a takes as inputone or more variant allele fractions 133 from the genomic featuresmodule 131. In some such embodiments, the variant allele fractions 133comprises a plurality of variant allele fractions obtained from tumortissue biopsies 133-t (e.g., 133-t-1, 133-t-2 . . . , 133-t-O) for thecohort of subjects. In some embodiments, the variant allele fractionscomprise a plurality of variant allele fractions obtained from liquidbiopsy samples 133-cf (e.g., 133-cf-1, 133-cf-2 . . . , 133-cf-N) forthe cohort of subjects. In some embodiments, the circulating variantallele fraction is obtained by comparing the liquid biopsy variantallele fractions 133-cf to the tumor biopsy variant allele fraction133-t.

Additional embodiments for using variant allele fractions (e.g., variantallele frequencies) to identify somatic variants are detailed below(see, Example Methods: Variant Identification).

A SNV/MNV algorithm 147 may identify a substitution of a singlenucleotide that occurs at a specific position in the genome. Forexample, at a specific base position, or locus, in the human genome, theC nucleotide may appear in most individuals, but in a minority ofindividuals, the position is occupied by an A. This means that there isa SNP at this specific position and the two possible nucleotidevariations, C or A, are said to be alleles for this position. SNPsunderlie differences in human susceptibility to a wide range of diseases(e.g. —sickle-cell anemia, β-thalassemia and cystic fibrosis result fromSNPs). The severity of illness and the way the body responds totreatments are also manifestations of genetic variations. For example, asingle-base mutation in the APOE (apolipoprotein E) gene is associatedwith a lower risk for Alzheimer's disease. A single-nucleotide variant(SNV) is a variation in a single nucleotide without any limitations offrequency and may arise in somatic cells. A somatic single-nucleotidevariation (e.g., caused by cancer) may also be called asingle-nucleotide alteration. An MNP (Multiple-nucleotide polymorphisms)module may identify the substitution of consecutive nucleotides at aspecific position in the genome.

An indel calling algorithm 148 may identify an insertion or deletion ofbases in the genome of an organism classified among small geneticvariations. While indels usually measure from 1 to 10 000 base pairs inlength, a microindel is defined as an indel that results in a net changeof 1 to 50 nucleotides. Indels can be contrasted with a SNP or pointmutation. An indel inserts and/or deletes nucleotides from a sequence,while a point mutation is a form of substitution that replaces one ofthe nucleotides without changing the overall number in the DNA. Indels,being insertions and/or deletions, can be used as genetic markers innatural populations, especially in phylogenetic studies. Indel frequencytends to be markedly lower than that of single nucleotide polymorphisms(SNP), except near highly repetitive regions, including homopolymers andmicrosatellites.

A genomic rearrangement algorithm 149 may identify hybrid genes formedfrom two previously separate genes. It can occur as a result oftranslocation, interstitial deletion, or chromosomal inversion. Genefusion can play an important role in tumorigenesis. Fusion genes cancontribute to tumor formation because fusion genes can produce much moreactive abnormal protein than non-fusion genes. Often, fusion genes areoncogenes that cause cancer; these include BCR-ABL, TEL-AML1 (ALL witht(12; 21)), AML1-ETO (M2 AML with t(8; 21)), and TMPRSS2-ERG with aninterstitial deletion on chromosome 21, often occurring in prostatecancer. In the case of TMPRSS2-ERG, by disrupting androgen receptor (AR)signaling and inhibiting AR expression by oncogenic ETS transcriptionfactor, the fusion product regulates prostate cancer. Most fusion genesare found from hematological cancers, sarcomas, and prostate cancer.BCAM-AKT2 is a fusion gene that is specific and unique to high-gradeserous ovarian cancer. Oncogenic fusion genes may lead to a gene productwith a new or different function from the two fusion partners.Alternatively, a proto-oncogene is fused to a strong promoter, andthereby the oncogenic function is set to function by an upregulationcaused by the strong promoter of the upstream fusion partner. The latteris common in lymphomas, where oncogenes are juxtaposed to the promotersof the immunoglobulin genes. Oncogenic fusion transcripts may also becaused by trans-splicing or read-through events. Since chromosomaltranslocations play such a significant role in neoplasia, a specializeddatabase of chromosomal aberrations and gene fusions in cancer has beencreated. This database is called Mitelman Database of ChromosomeAberrations and Gene Fusions in Cancer.

In some embodiments, feature extraction module 145 includes instructionsfor identifying one or more complex genomic alterations (e.g., featuresthat incorporate more than a change in the primary sequence of thegenome) in the cancer genome of the subject. For instance, in someembodiments, feature extraction module 145 includes modules foridentifying one or more of copy number variation (e.g., copy numbervariation analysis module 153), microsatellite instability status (e.g.,microsatellite instability analysis module 154), tumor mutational burden(e.g., tumor mutational burden analysis module 155), tumor ploidy (e.g.,tumor ploidy analysis module 156), and homologous recombination pathwaydeficiencies (e.g., homologous recombination pathway analysis module157).

Feature Analysis Module (160)

Referring again to FIG. 1A, the system (e.g., system 100) includes afeature analysis module 160 that includes one or more genomic alterationinterpretation algorithms 161, one or more optional clinical dataanalysis algorithms 165, an optional therapeutic curation algorithm 165,and an optional recommendation validation module 167. In someembodiments, feature analysis module 160 identifies actionable variantsand characteristics 139-1 and corresponding matched therapies 139-2and/or clinical trials using one or more analysis algorithms (e.g.,algorithms 162, 163, 164, and 165) to evaluate feature data 125. Theidentified actionable variants and characteristics 139-1 andcorresponding matched therapies 139-2, which are optionally stored intest patient data store 120, are then curated by feature analysis module160 to generate a clinical report 139-3, which is optionally validatedby a user, e.g., a clinician, before being transmitted to a medicalprofessional, e.g., an oncologist, treating the patient.

In some embodiments, the genomic alteration interpretation algorithms161 include instructions for evaluating the effect that one or moregenomic features 131 of the subject, e.g., as identified by featureextraction module 145, have on the characteristics of the patient'scancer and/or whether one or more targeted cancer therapies may improvethe clinical outcome for the patient. For example, in some embodiments,one or more genomic variant analysis algorithms 163 evaluate variousgenomic features 131 by querying a database, e.g., a look-up-table(“LUT”) of actionable genomic alterations, targeted therapies associatedwith the actionable genomic alterations, and any other conditions thatshould be met before administering the targeted therapy to a subjecthaving the actionable genomic alteration. For instance, evidencesuggests that depatuxizumab mafodotin (an anti-EGFR mAb conjugated tomonomethyl auristatin F) has improved efficacy for the treatment ofrecurrent glioblastomas having EGFR focal amplifications. van den BentM. et al., Cancer Chemother Pharmacol., 80(6):1209-17 (2017).Accordingly, the actionable genomic alteration LUT would have an entryfor the focal amplification of the EGFR gene indicating thatdepatuxizumab mafodotin is a targeted therapy for glioblastomas (e.g.,recurrent glioblastomas) having a focal gene amplification. In someinstances, the LUT may also include counter indications for theassociated targeted therapy, e.g., adverse drug interactions or personalcharacteristics that are counter-indicated for administration of theparticular targeted therapy.

In some embodiments, a genomic alteration interpretation algorithm 161determines whether a particular genomic feature 131 should be reportedto a medical professional treating the cancer patient. In someembodiments, genomic features 131 (e.g., genomic alterations andcompound features) are reported when there is clinical evidence that thefeature significantly impacts the biology of the cancer, impacts theprognosis for the cancer, and/or impacts pharmacogenomics, e.g., byindicating or counter-indicating particular therapeutic approaches. Forinstance, a genomic alteration interpretation algorithm 161 may classifya particular CNV feature 135 as “Reportable,” e.g., meaning that the CNVhas been identified as influencing the character of the cancer, theoverall disease state, and/or pharmacogenomics, as “Not Reportable,”e.g., meaning that the CNV has not been identified as influencing thecharacter of the cancer, the overall disease state, and/orpharmacogenomics, as “No Evidence,” e.g., meaning that no evidenceexists supporting that the CNV is “Reportable” or “Not Reportable,” oras “Conflicting Evidence,” e.g., meaning that evidence exists supportingboth that the CNV is “Reportable” and that the CNV is “Not Reportable.”

In some embodiments, the genomic alteration interpretation algorithms161 include one or more pathogenic variant analysis algorithms 162,which evaluate various genomic features to identify the presence of anoncogenic pathogen associated with the patient's cancer and/or targetedtherapies associated with an oncogenic pathogen infection in the cancer.For instance, RNA expression patterns of some cancers are associatedwith the presence of an oncogenic pathogen that is helping to drive thecancer. See, for example, U.S. patent application Ser. No. 16/802,126,filed Feb. 26, 2020, the content of which is hereby incorporated byreference, in its entirety, for all purposes. In some instances, therecommended therapy for the cancer is different when the cancer isassociated with the oncogenic pathogen infection than when it is not.Accordingly, in some embodiments, e.g., where feature data 125 includesRNA abundance data for the cancer of the patient, one or more pathogenicvariant analysis algorithms 162 evaluate the RNA abundance data for thepatient's cancer to determine whether a signature exists in the datathat indicates the presence of the oncogenic pathogen in the cancer.Similarly, in some embodiments, bioinformatics module 140 includes analgorithm that searches for the presence of pathogenic nucleic acidsequences in sequencing data 122. See, for example, U.S. ProvisionalPatent Application Ser. No. 62/978,067, filed Feb. 18, 2020, the contentof which is hereby incorporated by reference, in its entirety, for allpurposes. Accordingly, in some embodiments, one or more pathogenicvariant analysis algorithms 162 evaluates whether the presence of anoncogenic pathogen in a subject is associated with an actionable therapyfor the infection. In some embodiments, system 100 queries a database,e.g., a look-up-table (“LUT”), of actionable oncogenic pathogeninfections, targeted therapies associated with the actionableinfections, and any other conditions that should be met beforeadministering the targeted therapy to a subject that is infected withthe oncogenic pathogen. In some instances, the LUT may also includecounter indications for the associated targeted therapy, e.g., adversedrug interactions or personal characteristics that are counter-indicatedfor administration of the particular targeted therapy.

In some embodiments, the genomic alteration interpretation algorithms161 include one or more multi-feature analysis algorithms 164 thatevaluate a plurality of features to classify a cancer with respect tothe effects of one or more targeted therapies. For instance, in someembodiments, feature analysis module 160 includes one or moreclassifiers trained against feature data, one or more clinicaltherapies, and their associated clinical outcomes for a plurality oftraining subjects to classify cancers based on their predicted clinicaloutcomes following one or more therapies.

In some embodiments, the classifier is implemented as an artificialintelligence engine and may include gradient boosting models, randomforest models, neural networks (NN), regression models, Naive Bayesmodels, and/or machine learning algorithms (MLA). An MLA or a NN may betrained from a training data set that includes one or more features 125,including personal characteristics 126, medical history 127, clinicalfeatures 128, genomic features 131, and/or other -omic features 138.MLAs include supervised algorithms (such as algorithms where thefeatures/classifications in the data set are annotated) using linearregression, logistic regression, decision trees, classification andregression trees, naïve Bayes, nearest neighbor clustering; unsupervisedalgorithms (such as algorithms where no features/classification in thedata set are annotated) using Apriori, means clustering, principalcomponent analysis, random forest, adaptive boosting; andsemi-supervised algorithms (such as algorithms where an incompletenumber of features/classifications in the data set are annotated) usinggenerative approach (such as a mixture of Gaussian distributions,mixture of multinomial distributions, hidden Markov models), low densityseparation, graph-based approaches (such as mincut, harmonic function,manifold regularization), heuristic approaches, or support vectormachines.

NNs include conditional random fields, convolutional neural networks,attention based neural networks, deep learning, long short term memorynetworks, or other neural models where the training data set includes aplurality of tumor samples, RNA expression data for each sample, andpathology reports covering imaging data for each sample.

While MLA and neural networks identify distinct approaches to machinelearning, the terms may be used interchangeably herein. Thus, a mentionof MLA may include a corresponding NN or a mention of NN may include acorresponding MLA unless explicitly stated otherwise. Training mayinclude providing optimized datasets, labeling these traits as theyoccur in patient records, and training the MLA to predict or classifybased on new inputs. Artificial NNs are efficient computing models whichhave shown their strengths in solving hard problems in artificialintelligence. They have also been shown to be universal approximators,that is, they can represent a wide variety of functions when givenappropriate parameters.

In some embodiments, system 100 includes a classifier training modulethat includes instructions for training one or more untrained orpartially trained classifiers based on feature data from a trainingdataset. In some embodiments, system 100 also includes a database oftraining data for use in training the one or more classifiers. In otherembodiments, the classifier training module accesses a remote storagedevice hosting training data. In some embodiments, the training dataincludes a set of training features, including but not limited to,various types of the feature data 125 illustrated in FIG. 1B. In someembodiments, the classifier training module uses patient data 121, e.g.,when test patient data store 120 also stores a record of treatmentsadministered to the patient and patient outcomes following therapy.

In some embodiments, feature analysis module 160 includes one or moreclinical data analysis algorithms 165, which evaluate clinical features128 of a cancer to identify targeted therapies which may benefit thesubject. For example, in some embodiments, e.g., where feature data 125includes pathology data 128-1, one or more clinical data analysisalgorithms 165 evaluate the data to determine whether an actionabletherapy is indicated based on the histopathology of a tumor biopsy fromthe subject, e.g., which is indicative of a particular cancer typeand/or stage of cancer. In some embodiments, system 100 queries adatabase, e.g., a look-up-table (“LUT”), of actionable clinical features(e.g., pathology features), targeted therapies associated with theactionable features, and any other conditions that should be met beforeadministering the targeted therapy to a subject associated with theactionable clinical features 128 (e.g., pathology features 128-1). Insome embodiments, system 100 evaluates the clinical features 128 (e.g.,pathology features 128-1) directly to determine whether the patient'scancer is sensitive to a particular therapeutic agent. Further detailson example methods, systems, and algorithms for classifying cancer andidentifying targeted therapies based on clinical data, such as pathologydata 128-1, imaging data 138-2, and/or tissue culture/organoid data128-3 are discussed, for example, in U.S. patent application Ser. No.16/830,186, filed on Mar. 25, 2020, U.S. patent application Ser. No.16/789,363, filed on Feb. 12, 2020, and U.S. Provisional Application No.63/007,874, filed on Apr. 9, 2020, the contents of which are herebyincorporated by reference, in their entireties, for all purposes.

In some embodiments, feature analysis module 160 includes a clinicaltrials module that evaluates test patient data 121 to determine whetherthe patient is eligible for inclusion in a clinical trial for a cancertherapy, e.g., a clinical trial that is currently recruiting patients, aclinical trial that has not yet begun recruiting patients, and/or anongoing clinical trial that may recruit additional patients in thefuture. In some embodiments, a clinical trial module evaluates testpatient data 121 to determine whether the results of a clinical trialare relevant for the patient, e.g., the results of an ongoing clinicaltrial and/or the results of a completed clinical trial. For instance, insome embodiments, system 100 queries a database, e.g., a look-up-table(“LUT”) of clinical trials, e.g., active and/or completed clinicaltrials, and compares patient data 121 with inclusion criteria for theclinical trials, stored in the database, to identify clinical trialswith inclusion criteria that closely match and/or exactly match thepatient's data 121. In some embodiments, a record of matching clinicaltrials, e.g., those clinical trials that the patient may be eligible forand/or that may inform personalized treatment decisions for the patient,are stored in clinical assessment database 139.

In some embodiments, feature analysis module 160 includes a therapeuticcuration algorithm 166 that assembles actionable variants andcharacteristics 139-1, matched therapies 139-2, and/or relevant clinicaltrials identified for the patient, as described above. In someembodiments, a therapeutic curation algorithm 166 evaluates certaincriteria related to which actionable variants and characteristics 139-1,matched therapies 139-2, and/or relevant clinical trials should bereported and/or whether certain matched therapies, considered alone orin combination, may be counter-indicated for the patient, e.g., based onpersonal characteristics 126 of the patient and/or known drug-druginteractions. In some embodiments, the therapeutic curation algorithmthen generates one or more clinical reports 139-3 for the patient. Insome embodiments, the therapeutic curation algorithm generates a firstclinical report 139-3-1 that is to be reported to a medical professionaltreating the patient and a second clinical report 139-3-2 that will notbe communicated to the medical professional, but may be used to improvevarious algorithms within the system.

In some embodiments, feature analysis module 160 includes arecommendation validation module 167, that includes an interfaceallowing a clinician to review, modify, and approve a clinical report139-3 prior to the report being sent to a medical professional, e.g., anoncologist, treating the patient.

In some embodiments, each of the one or more feature collections,sequencing modules, bioinformatics modules (including, e.g., alterationmodule(s), structural variant calling and data processing modules),classification modules and outcome modules are communicatively coupledto a data bus to transfer data between each module for processing and/orstorage. In some alternative embodiments, each of the featurecollection, alteration module(s), structural variant and feature storeare communicatively coupled to each other for independent communicationwithout sharing the data bus.

Further details on systems and exemplary embodiments of modules andfeature collections are discussed in PCT Application PCT/US19/69149,titled “A METHOD AND PROCESS FOR PREDICTING AND ANALYZING PATIENT COHORTRESPONSE, PROGRESSION, AND SURVIVAL,” filed Dec. 31, 2019, which ishereby incorporated herein by reference in its entirety.

Example Methods

Now that details of a system 100 for providing clinical support forpersonalized cancer therapy, e.g., with improved validation of somaticsequence variants, have been disclosed, details regarding processes andfeatures of the system, in accordance with various embodiments of thepresent disclosure, are disclosed below. Specifically, example processesare described below with reference to FIGS. 2A, 3, 4A-4F, 5A-5B, 6, and7. In some embodiments, such processes and features of the system arecarried out by modules 118, 120, 140, 160, and/or 170, as illustrated inFIG. 1A. Referring to these methods, the systems described herein (e.g.,system 100) include instructions for validating somatic variants thatare improved compared to conventional methods for somatic variantdetection.

FIG. 2B: Distributed Diagnostic and Clinical Environment

In some aspects, the methods described herein for providing clinicalsupport for personalized cancer therapy are performed across adistributed diagnostic/clinical environment, e.g., as illustrated inFIG. 2B. However, in some embodiments, the improved methods describedherein for validating somatic sequence variants are performed at asingle location, e.g., at a single computing system or environment,although ancillary procedures supporting the methods described herein,and/or procedures that make further use of the results of the methodsdescribed herein, may be performed across a distributeddiagnostic/clinical environment.

FIG. 2B illustrates an example of a distributed diagnostic/clinicalenvironment 210. In some embodiments, the distributeddiagnostic/clinical environment is connected via communication network105. In some embodiments, one or more biological samples, e.g., one ormore liquid biopsy samples, solid tumor biopsy, normal tissue samples,and/or control samples, are collected from a subject in clinicalenvironment 220, e.g., a doctor's office, hospital, or medical clinic,or at a home health care environment (not depicted). Advantageously,while solid tumor samples should be collected within a clinical setting,liquid biopsy samples can be acquired in a less invasive fashion and aremore easily collected outside of a traditional clinical setting. In someembodiments, one or more biological samples, or portions thereof, areprocessed within the clinical environment 220 where collection occurred,using a processing device 224, e.g., a nucleic acid sequencer forobtaining sequencing data, a microscope for obtaining pathology data, amass spectrometer for obtaining proteomic data, etc. In someembodiments, one or more biological samples, or portions thereof aresent to one or more external environments, e.g., sequencing lab 230,pathology lab 240, and/or molecular biology lab 250, each of whichincludes a processing device 234, 244, and 254, respectively, togenerate biological data 121 for the subject. Each environment includesa communications device 222, 232, 242, and 252, respectively, forcommunicating biological data 121 about the subject to a processingserver 262 and/or database 264, which may be located in yet anotherenvironment, e.g., processing/storage center 260. Thus, in someembodiments, different portions of the systems and methods describedherein are fulfilled by different processing devices located indifferent physical environments.

Accordingly, in some embodiments, a method for providing clinicalsupport for personalized cancer therapy, e.g., with improved validationof somatic sequence variants, is performed across one or moreenvironments, as illustrated in FIG. 2B. For instance, in some suchembodiments, a liquid biopsy sample is collected at clinical environment220 or in a home healthcare environment. The sample, or a portionthereof, is sent to sequencing lab 230 where raw sequence reads 123 ofnucleic acids in the sample are generated by sequencer 234. The rawsequencing data 123 is communicated, e.g., from communications device232, to database 264 at processing/storage center 260, where processingserver 262 extracts features from the sequence reads by executing one ormore of the processes in bioinformatics module 140, thereby generatinggenomic features 131 for the sample. Processing server 262 may thenanalyze the identified features by executing one or more of theprocesses in feature analysis module 160, thereby generating clinicalassessment 139, including a clinical report 139-3. A clinician mayaccess clinical report 139-3, e.g., at processing/storage center 260 orthrough communications network 105, via recommendation validation module167. After final approval, clinical report 139-3 is transmitted to amedical professional, e.g., an oncologist, at clinical environment 220,who uses the report to support clinical decision making for personalizedtreatment of the patient's cancer.

FIG. 2A: Example Workflow for Precision Oncology

FIG. 2A is a flowchart of an example workflow 200 for collecting andanalyzing data in order to generate a clinical report 139 to supportclinical decision making in precision oncology. Advantageously, themethods described herein improve this process, for example, by improvingvarious stages within feature extraction 206, including validation ofsomatic sequence variants.

Briefly, the workflow begins with patient intake and sample collection201, where one or more liquid biopsy samples, one or more tumor biopsy,and one or more normal and/or control tissue samples are collected fromthe patient (e.g., at a clinical environment 220 or home healthcareenvironment, as illustrated in FIG. 2B). In some embodiments, personaldata 126 corresponding to the patient and a record of the one or morebiological samples obtained (e.g., patient identifiers, patient clinicaldata, sample type, sample identifiers, cancer conditions, etc.) areentered into a data analysis platform, e.g., test patient data store120. Accordingly, in some embodiments, the methods disclosed hereininclude obtaining one or more biological samples from one or moresubjects, e.g., cancer patients. In some embodiments, the subject is ahuman, e.g., a human cancer patient.

In some embodiments, one or more of the biological samples obtained fromthe patient are a biological liquid sample, also referred to as a liquidbiopsy sample. In some embodiments, one or more of the biologicalsamples obtained from the patient are selected from blood, plasma,serum, urine, vaginal fluid, fluid from a hydrocele (e.g., of thetestis), vaginal flushing fluids, pleural fluid, ascitic fluid,cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolarlavage fluid, discharge fluid from the nipple, aspiration fluid fromdifferent parts of the body (e.g., thyroid, breast), etc. In someembodiments, the liquid biopsy sample includes blood and/or saliva. Insome embodiments, the liquid biopsy sample is peripheral blood. In someembodiments, blood samples are collected from patients in commercialblood collection containers, e.g., using a PAXgene® Blood DNA Tubes. Insome embodiments, saliva samples are collected from patients incommercial saliva collection containers, e.g., using an Oragene® DNASaliva Kit.

In some embodiments, the liquid biopsy sample has a volume of from about1 mL to about 50 mL. For example, in some embodiments, the liquid biopsysample has a volume of about 1 mL, about 2 mL, about 3 mL, about 4 mL,about 5 mL, about 6 mL, about 7 mL, about 8 mL, about 9 mL, about 10 mL,about 11 mL, about 12 mL, about 13 mL, about 14 mL, about 15 mL, about16 mL, about 17 mL, about 18 mL, about 19 mL, about 20 mL, or greater.

Liquid biopsy samples include cell free nucleic acids, includingcell-free DNA (cfDNA). As described above, cfDNA isolated from cancerpatients includes DNA originating from cancerous cells, also referred toas circulating tumor DNA (ctDNA), cfDNA originating from germline (e.g.,healthy or non-cancerous) cells, and cfDNA originating fromhematopoietic cells (e.g., white blood cells). The relative proportionsof cancerous and non-cancerous cfDNA present in a liquid biopsy samplevaries depending on the characteristics (e.g., the type, stage, lineage,genomic profile, etc.) of the patient's cancer. As used herein, the‘tumor burden’ of the subject refers to the percentage cfDNA thatoriginated from cancerous cells.

As described herein, cfDNA is a particularly useful source of biologicaldata for various implementations of the methods and systems describedherein, because it is readily obtained from various body fluids.Advantageously, use of bodily fluids facilitates serial monitoringbecause of the ease of collection, as these fluids are collectable bynon-invasive or minimally invasive methodologies. This is in contrast tomethods that rely upon solid tissue samples, such as biopsies, whichoften times require invasive surgical procedures. Further, becausebodily fluids, such as blood, circulate throughout the body, the cfDNApopulation represents a sampling of many different tissue types frommany different locations.

In some embodiments, a liquid biopsy sample is separated into twodifferent samples. For example, in some embodiments, a blood sample isseparated into a blood plasma sample, containing cfDNA, and a buffy coatpreparation, containing white blood cells.

In some embodiments, a plurality of liquid biopsy samples is obtainedfrom a respective subject at intervals over a period of time (e.g.,using serial testing). For example, in some such embodiments, the timebetween obtaining liquid biopsy samples from a respective subject is atleast 1 day, at least 2 days, at least 1 week, at least 2 weeks, atleast 1 month, at least 2 months, at least 3 months, at least 4 months,at least 6 months, or at least 1 year.

In some embodiments, one or more biological samples collected from thepatient is a solid tissue sample, e.g., a solid tumor sample or a solidnormal tissue sample. Methods for obtaining solid tissue samples, e.g.,of cancerous and/or normal tissue are known in the art and are dependentupon the type of tissue being sampled. For example, bone marrow biopsiesand isolation of circulating tumor cells can be used to obtain samplesof blood cancers, endoscopic biopsies can be used to obtain samples ofcancers of the digestive tract, bladder, and lungs, needle biopsies(e.g., fine-needle aspiration, core needle aspiration, vacuum-assistedbiopsy, and image-guided biopsy, can be used to obtain samples ofsubdermal tumors, skin biopsies, e.g., shave biopsy, punch biopsy,incisional biopsy, and excisional biopsy, can be used to obtain samplesof dermal cancers, and surgical biopsies can be used to obtain samplesof cancers affecting internal organs of a patient. In some embodiments,a solid tissue sample is a formalin-fixed tissue (FFT). In someembodiments, a solid tissue sample is a macro-dissected formalin fixedparaffin embedded (FFPE) tissue. In some embodiments, a solid tissuesample is a fresh frozen tissue sample.

In some embodiments, a dedicated normal sample is collected from thepatient, for co-processing with a liquid biopsy sample. Generally, thenormal sample is of a non-cancerous tissue, and can be collected usingany tissue collection means described above. In some embodiments, buccalcells collected from the inside of a patient's cheeks are used as anormal sample. Buccal cells can be collected by placing an absorbentmaterial, e.g., a swab, in the subject's mouth and rubbing it againsttheir cheek, e.g., for at least 15 second or for at least 30 seconds.The swab is then removed from the patient's mouth and inserted into atube, such that the tip of the tube is submerged into a liquid thatserves to extract the buccal cells off of the absorbent material. Anexample of buccal cell recovery and collection devices is provided inU.S. Pat. No. 9,138,205, the content of which is hereby incorporated byreference, in its entirety, for all purposes. In some embodiments, thebuccal swab DNA is used as a source of normal DNA in circulating hememalignancies.

The biological samples collected from the patient are, optionally, sentto various analytical environments (e.g., sequencing lab 230, pathologylab 240, and/or molecular biology lab 250) for processing (e.g., datacollection) and/or analysis (e.g., feature extraction). Wet labprocessing 204 may include cataloguing samples (e.g., accessioning),examining clinical features of one or more samples (e.g., pathologyreview), and nucleic acid sequence analysis (e.g., extraction, libraryprep, capture+hybridize, pooling, and sequencing). In some embodiments,the workflow includes clinical analysis of one or more biologicalsamples collected from the subject, e.g., at a pathology lab 240 and/ora molecular and cellular biology lab 250, to generate clinical featuressuch as pathology features 128-3, imaging data 128-3, and/or tissueculture/organoid data 128-3.

In some embodiments, the pathology data 128-1 collected during clinicalevaluation includes visual features identified by a pathologist'sinspection of a specimen (e.g., a solid tumor biopsy), e.g., of stainedH&E or IHC slides. In some embodiments, the sample is a solid tissuebiopsy sample. In some embodiments, the tissue biopsy sample is aformalin-fixed tissue (FFT), e.g., a formalin-fixed paraffin-embedded(FFPE) tissue. In some embodiments, the tissue biopsy sample is an FFPEor FFT block. In some embodiments, the tissue biopsy sample is afresh-frozen tissue biopsy. The tissue biopsy sample can be prepared inthin sections (e.g., by cutting and/or affixing to a slide), tofacilitate pathology review (e.g., by staining with immunohistochemistrystain for IHC review and/or with hematoxylin and eosin stain for H&Epathology review). For instance, analysis of slides for H&E staining orIHC staining may reveal features such as tumor infiltration, programmeddeath-ligand 1 (PD-L1) status, human leukocyte antigen (HLA) status, orother immunological features.

In some embodiments, a liquid sample (e.g., blood) collected from thepatient (e.g., in EDTA-containing collection tubes) is prepared on aslide (e.g., by smearing) for pathology review. In some embodiments,macrodissected FFPE tissue sections, which may be mounted on ahistopathology slide, from solid tissue samples (e.g., tumor or normaltissue) are analyzed by pathologists. In some embodiments, tumor samplesare evaluated to determine, e.g., the tumor purity of the sample, thepercent tumor cellularity as a ratio of tumor to normal nuclei, etc. Foreach section, background tissue may be excluded or removed such that thesection meets a tumor purity threshold, e.g., where at least 20% of thenuclei in the section are tumor nuclei, or where at least 25%, 30%, 40%,50%, 60%, 70%, 80%, 90%, or more of the nuclei in the section are tumornuclei.

In some embodiments, pathology data 128-1 is extracted, in addition toor instead of visual inspection, using computational approaches todigital pathology, e.g., providing morphometric features extracted fromdigital images of stained tissue samples. A review of digital pathologymethods is provided in Bera, K. et al., Nat. Rev. Clin. Oncol.,16:703-15 (2019), the content of which is hereby incorporated byreference, in its entirety, for all purposes. In some embodiments,pathology data 128-1 includes features determined using machine learningalgorithms to evaluate pathology data collected as described above.

Further details on methods, systems, and algorithms for using pathologydata to classify cancer and identify targeted therapies are discussed,for example, in are discussed, for example, in U.S. patent applicationSer. No. 16/830,186, filed on Mar. 25, 2020, and U.S. ProvisionalApplication No. 63/007,874, filed on Apr. 9, 2020, the contents of whichare hereby incorporated by reference, in their entireties, for allpurposes.

In some embodiments, imaging data 128-2 collected during clinicalevaluation includes features identified by review of in-vitro and/orin-vivo imaging results (e.g., of a tumor site), for example a size of atumor, tumor size differentials over time (such as during treatment orduring other periods of change). In some embodiments, imaging data 128-2includes features determined using machine learning algorithms toevaluate imaging data collected as described above.

Further details on methods, systems, and algorithms for using medicalimaging to classify cancer and identify targeted therapies arediscussed, for example, in are discussed, for example, in U.S. patentapplication Ser. No. 16/830,186, filed on Mar. 25, 2020, and U.S.Provisional Application No. 63/007,874, filed on Apr. 9, 2020, thecontents of which are hereby incorporated by reference, in theirentireties, for all purposes.

In some embodiments, tissue culture/organoid data 128-3 collected duringclinical evaluation includes features identified by evaluation ofcultured tissue from the subject. For instance, in some embodiments,tissue samples obtained from the patients (e.g., tumor tissue, normaltissue, or both) are cultured (e.g., in liquid culture, solid-phaseculture, and/or organoid culture) and various features, such as cellmorphology, growth characteristics, genomic alterations, and/or drugsensitivity, are evaluated. In some embodiments, tissue culture/organoiddata 128-3 includes features determined using machine learningalgorithms to evaluate tissue culture/organoid data collected asdescribed above. Examples of tissue organoid (e.g., personal tumororganoid) culturing and feature extractions thereof are described inU.S. Provisional Application Ser. No. 62/924,621, filed on Oct. 22,2019, and U.S. patent application Ser. No. 16/693,117, filed on Nov. 22,2019, the contents of which are hereby incorporated by reference, intheir entireties, for all purposes.

Nucleic acid sequencing of one or more samples collected from thesubject is performed, e.g., at sequencing lab 230, during wet labprocessing 204. An example workflow for nucleic acid sequencing isillustrated in FIG. 3. In some embodiments, the one or more biologicalsamples obtained at the sequencing lab 230 are accessioned (302), totrack the sample and data through the sequencing process.

Next, nucleic acids, e.g., RNA and/or DNA are extracted (304) from theone or more biological samples. Methods for isolating nucleic acids frombiological samples are known in the art, and are dependent upon the typeof nucleic acid being isolated (e.g., cfDNA, DNA, and/or RNA) and thetype of sample from which the nucleic acids are being isolated (e.g.,liquid biopsy samples, white blood cell buffy coat preparations,formalin-fixed paraffin-embedded (FFPE) solid tissue samples, and freshfrozen solid tissue samples). The selection of any particular nucleicacid isolation technique for use in conjunction with the embodimentsdescribed herein is well within the skill of the person having ordinaryskill in the art, who will consider the sample type, the state of thesample, the type of nucleic acid being sequenced and the sequencingtechnology being used.

For instance, many techniques for DNA isolation, e.g., genomic DNAisolation, from a tissue sample are known in the art, such as organicextraction, silica adsorption, and anion exchange chromatography.Likewise, many techniques for RNA isolation, e.g., mRNA isolation, froma tissue sample are known in the art. For example, acid guanidiniumthiocyanate-phenol-chloroform extraction (see, for example, Chomczynskiand Sacchi, 2006, Nat Protoc, 1(2):581-85, which is hereby incorporatedby reference herein), and silica bead/glass fiber adsorption (see, forexample, Poeckh, T. et al., 2008, Anal Biochem., 373(2):253-62, which ishereby incorporated by reference herein). The selection of anyparticular DNA or RNA isolation technique for use in conjunction withthe embodiments described herein is well within the skill of the personhaving ordinary skill in the art, who will consider the tissue type, thestate of the tissue, e.g., fresh, frozen, formalin-fixed,paraffin-embedded (FFPE), and the type of nucleic acid analysis that isto be performed.

In some embodiments where the biological sample is a liquid biopsysample, e.g., a blood or blood plasma sample, cfDNA is isolated fromblood samples using commercially available reagents, includingproteinase K, to generate a liquid solution of cfDNA.

In some embodiments, isolated DNA molecules are mechanically sheared toan average length using an ultrasonicator (for example, a Covarisultrasonicator). In some embodiments, isolated nucleic acid moleculesare analyzed to determine their fragment size, e.g., through gelelectrophoresis techniques and/or the use of a device such as a LabChipGX Touch. The skilled artisan will know of an appropriate range offragment sizes, based on the sequencing technique being employed, asdifferent sequencing techniques have differing fragment sizerequirements for robust sequencing. In some embodiments, quality controltesting is performed on the extracted nucleic acids (e.g., DNA and/orRNA), e.g., to assess the nucleic acid concentration and/or fragmentsize. For example, sizing of DNA fragments provides valuable informationused for downstream processing, such as determining whether DNAfragments require additional shearing prior to sequencing.

Wet lab processing 204 then includes preparing a nucleic acid libraryfrom the isolated nucleic acids (e.g., cfDNA, DNA, and/or RNA). Forexample, in some embodiments, DNA libraries (e.g., gDNA and/or cfDNAlibraries) are prepared from isolated DNA from the one or morebiological samples. In some embodiments, the DNA libraries are preparedusing a commercial library preparation kit, e.g., the KAPA Hyper PrepKit, a New England Biolabs (NEB) kit, or a similar kit.

In some embodiments, during library preparation, adapters (e.g., UDIadapters, such as Roche SeqCap dual end adapters, or UMI adapters suchas full length or stubby Y adapters) are ligated onto the nucleic acidmolecules. In some embodiments, the adapters include unique molecularidentifiers (UMIs), which are short nucleic acid sequences (e.g., 3-10base pairs) that are added to ends of DNA fragments during adapterligation. In some embodiments, UMIs are degenerate base pairs that serveas a unique tag that can be used to identify sequence reads originatingfrom a specific DNA fragment. In some embodiments, e.g., when multiplexsequencing will be used to sequence DNA from a plurality of samples(e.g., from the same or different subjects) in a single sequencingreaction, a patient-specific index is also added to the nucleic acidmolecules. In some embodiments, the patient specific index is a shortnucleic acid sequence (e.g., 3-20 nucleotides) that are added to ends ofDNA fragments during library construction, that serve as a unique tagthat can be used to identify sequence reads originating from a specificpatient sample. Examples of identifier sequences are described, forexample, in Kivioja et al., Nat. Methods 9(1):72-74 (2011) and Islam etal., Nat. Methods 11(2):163-66 (2014), the contents of which are herebyincorporated by reference, in their entireties, for all purposes.

In some embodiments, an adapter includes a PCR primer landing site,designed for efficient binding of a PCR or second-strand synthesisprimer used during the sequencing reaction. In some embodiments, anadapter includes an anchor binding site, to facilitate binding of theDNA molecule to anchor oligonucleotide molecules on a sequencer flowcell, serving as a seed for the sequencing process by providing astarting point for the sequencing reaction. During PCR amplificationfollowing adapter ligation, the UMIs, patient indexes, and binding sitesare replicated along with the attached DNA fragment. This provides a wayto identify sequence reads that came from the same original fragment indownstream analysis.

In some embodiments, DNA libraries are amplified and purified usingcommercial reagents, (e.g., Axygen MAG PCR clean up beads). In some suchembodiments, the concentration and/or quantity of the DNA molecules arethen quantified using a fluorescent dye and a fluorescence microplatereader, standard spectrofluorometer, or filter fluorometer. In someembodiments, library amplification is performed on a device (e.g., anIllumina C-Bot2) and the resulting flow cell containing amplifiedtarget-captured DNA libraries is sequenced on a next generationsequencer (e.g., an Illumina HiSeq 4000 or an Illumina NovaSeq 6000) toa unique on-target depth selected by the user. In some embodiments, DNAlibrary preparation is performed with an automated system, using aliquid handling robot (e.g., a SciClone NGSx).

In some embodiments, where feature data 125 includes methylation states132 for one or more genomic locations, nucleic acids isolated from thebiological sample (e.g., cfDNA) are treated to convert unmethylatedcytosines to uracils, e.g., prior to generating the sequencing library.Accordingly, when the nucleic acids are sequenced, all cytosines calledin the sequencing reaction were necessarily methylated, since theunmethylated cytosines were converted to uracils and accordingly wouldhave been called as thymidines, rather than cytosines, in the sequencingreaction. Commercial kits are available for bisulfite-mediatedconversion of methylated cytosines to uracils, for instance, the EZ DNAMethylation™-Gold, EZ DNA Methylation™-Direct, and EZ DNAMethylation™-Lightning kit (available from Zymo Research Corp (Irvine,Calif.)). Commercial kits are also available for enzymatic conversion ofmethylated cytosines to uracils, for example, the APOBEC-Seq kit(available from NEBiolabs, Ipswich, Mass.).

In some embodiments, wet lab processing 204 includes pooling (308) DNAmolecules from a plurality of libraries, corresponding to differentsamples from the same and/or different patients, to forming a sequencingpool of DNA libraries. When the pool of DNA libraries is sequenced, theresulting sequence reads correspond to nucleic acids isolated frommultiple samples. The sequence reads can be separated into differentsequence read files, corresponding to the various samples represented inthe sequencing read based on the unique identifiers present in the addednucleic acid fragments. In this fashion, a single sequencing reactioncan generate sequence reads from multiple samples. Advantageously, thisallows for the processing of more samples per sequencing reaction.

In some embodiments, wet lab processing 204 includes enriching (310) asequencing library, or pool of sequencing libraries, for target nucleicacids, e.g., nucleic acids encompassing loci that are informative forprecision oncology and/or used as internal controls for the sequencingor bioinformatics processes. In some embodiments, enrichment is achievedby hybridizing target nucleic acids in the sequencing library to probesthat hybridize to the target sequences, and then isolating the capturednucleic acids away from off-target nucleic acids that are not bound bythe capture probes.

Advantageously, enriching for target sequences prior to sequencingnucleic acids significantly reduces the costs and time associated withsequencing, facilitates multiplex sequencing by allowing multiplesamples to be mixed together for a single sequencing reaction, andsignificantly reduces the computation burden of aligning the resultingsequence reads, as a result of significantly reducing the total amountof nucleic acids analyzed from each sample.

In some embodiments, the enrichment is performed prior to poolingmultiple nucleic acid sequencing libraries. However, in otherembodiments, the enrichment is performed after pooling nucleic acidsequencing libraries, which has the advantage of reducing the number ofenrichment assays that have to be performed.

In some embodiments, the enrichment is performed prior to generating anucleic acid sequencing library. This has the advantage that fewerreagents are needed to perform both the enrichment (because there arefewer target sequences at this point, prior to library amplification)and the library production (because there are fewer nucleic acidmolecules to tag and amplify after the enrichment). However, this raisesthe possibility of pull-down bias and/or that small variations in theenrichment protocol will result in less consistent results.

In some embodiments, nucleic acid libraries are pooled (two or more DNAlibraries may be mixed to create a pool) and treated with reagents toreduce off-target capture, for example Human COT-1 and/or IDT xGenUniversal Blockers. Pools may be dried in a vacufuge and resuspended.DNA libraries or pools may be hybridized to a probe set (for example, aprobe set specific to a panel that includes loci from at least 100, 600,1,000, 10,000, etc. of the 19,000 known human genes) and amplified withcommercially available reagents (for example, the KAPA HiFi HotStartReadyMix). For example, in some embodiments, a pool is incubated in anincubator, PCR machine, water bath, or other temperature-modulatingdevice to allow probes to hybridize. Pools may then be mixed withStreptavidin-coated beads or another means for capturing hybridizedDNA-probe molecules, such as DNA molecules representing exons of thehuman genome and/or genes selected for a genetic panel.

Pools may be amplified and purified more than once using commerciallyavailable reagents, for example, the KAPA HiFi Library Amplification kitand Axygen MAG PCR clean up beads, respectively. The pools or DNAlibraries may be analyzed to determine the concentration or quantity ofDNA molecules, for example by using a fluorescent dye (for example,PicoGreen pool quantification) and a fluorescence microplate reader,standard spectrofluorometer, or filter fluorometer. In one example, theDNA library preparation and/or capture is performed with an automatedsystem, using a liquid handling robot (for example, a SciClone NGSx).

In some embodiments, e.g., where a whole genome sequencing method willbe used, nucleic acid sequencing libraries are not target-enriched priorto sequencing, in order to obtain sequencing data on substantially allof the competent nucleic acids in the sequencing library. Similarly, insome embodiments, e.g., where a whole genome sequencing method will beused, nucleic acid sequencing libraries are not mixed, because ofbandwidth limitations related to obtaining significant sequencing depthacross an entire genome. However, in other embodiments, e.g., where alow pass whole genome sequencing (LPWGS) methodology will be used,nucleic acid sequencing libraries can still be pooled, because very lowaverage sequencing coverage is achieved across a respective genome,e.g., between about 0.5× and about 5×.

In some embodiments, a plurality of nucleic acid probes (e.g., a probeset) is used to enrich one or more target sequences in a nucleic acidsample (e.g., an isolated nucleic acid sample or a nucleic acidsequencing library), e.g., where one or more target sequences isinformative for precision oncology. For instance, in some embodiments,one or more of the target sequences encompasses a locus that isassociated with an actionable allele. That is, variations of the targetsequence are associated with targeted therapeutic approaches. In someembodiments, one or more of the target sequences and/or a property ofone or more of the target sequences is used in a classifier trained todistinguish two or more cancer states.

In some embodiments, the probe set includes probes targeting one or moregene loci, e.g., exon or intron loci. In some embodiments, the probe setincludes probes targeting one or more loci not encoding a protein, e.g.,regulatory loci, miRNA loci, and other non-coding loci, e.g., that havebeen found to be associated with cancer. In some embodiments, theplurality of loci includes at least 25, 50, 100, 150, 200, 250, 300,350, 400, 500, 750, 1000, 2500, 5000, or more human genomic loci.

In some embodiments, the probe set includes probes targeting one or moreof the genes listed in Table 1. In some embodiments, the probe setincludes probes targeting at least 5 of the genes listed in Table 1. Insome embodiments, the probe set includes probes targeting at least 10 ofthe genes listed in Table 1. In some embodiments, the probe set includesprobes targeting at least 25 of the genes listed in Table 1. In someembodiments, the probe set includes probes targeting at least 50 of thegenes listed in Table 1. In some embodiments, the probe set includesprobes targeting at least 75 of the genes listed in Table 1. In someembodiments, the probe set includes probes targeting at least 100 of thegenes listed in Table 1. In some embodiments, the probe set includesprobes targeting all of the genes listed in Table 1.

TABLE 1 An example panel of 105 genes that are informative for precisiononcology. Example Gene Panel for Precision Oncology ALK B2M ERRFI1 IDH2MSH6 PIK3R1 SPOP FGFR2 BAP1 ESR1 JAK1 MTOR PMS2 STK11 FGFR3 BRCA1 EZH2JAK2 MYCN PTCH1 TERT NTRK1 BRCA2 FBXW7 JAK3 NF1 PTEN TP53 RET BTK FGFR1KDR NF2 PTPN11 TSC1 ROS1 CCND1 FGFR4 KEAP1 NFE2L2 RAD51C TSC2 BRAF CCND2FLT3 KIT NOTCH1 RAF1 UGT1A1 AKT1 CCND3 FOXL2 KRAS NPM1 RB1 VHL AKT2 CDH1GATA3 MAP2K1 NRAS RHEB CCNE1 APC CDK4 GNA11 MAP2K2 PALB2 RHOA CD274 ARCDK6 GNAQ MAPK1 PBRM1 RIT1 EGFR ARAF CDKN2A GNAS MLH1 PDCD1LG2 RNF43ERBB2 ARID1A CTNNB1 HNF1A MPL PDGFRA SDHA MET ATM DDR2 HRAS MSH2 PDGFRBSMAD4 MYC ATR DPYD IDH1 MSH3 PIK3CA SMO KMT2A

In some embodiments, the probe set includes probes targeting one or moreof the genes listed in List 1, provided below. In some embodiments, theprobe set includes probes targeting at least 5 of the genes listed inList 1. In some embodiments, the probe set includes probes targeting atleast 10 of the genes listed in List 1. In some embodiments, the probeset includes probes targeting at least 25 of the genes listed in List 1.In some embodiments, the probe set includes probes targeting at least 50of the genes listed in List 1. In some embodiments, the probe setincludes probes targeting at least 70 of the genes listed in List 1. Insome embodiments, the probe set includes probes targeting all of thegenes listed in List 1.

In some embodiments, the probe set includes probes targeting one or moreof the genes listed in List 2, provided below. In some embodiments, theprobe set includes probes targeting at least 5 of the genes listed inList 2. In some embodiments, the probe set includes probes targeting atleast 10 of the genes listed in List 2. In some embodiments, the probeset includes probes targeting at least 25 of the genes listed in List 2.In some embodiments, the probe set includes probes targeting at least 50of the genes listed in List 2. In some embodiments, the probe setincludes probes targeting at least 75 of the genes listed in List 2. Insome embodiments, the probe set includes probes targeting at least 100of the genes listed in List 2. In some embodiments, the probe setincludes probes targeting all of the genes listed in List 2.

In some embodiments, panels of genes including one or more genes fromthe following lists are used for analyzing specimens, sequencing, and/oridentification. In some embodiments, panels of genes for analyzingspecimens, sequencing, and/or identification include one or more genesfrom List 1 or List 2. In some embodiments, panels of genes foranalyzing specimens, sequencing, and/or identification include one ormore genes from:

List 1: AKT1 (14q32.33), ALK (2p23.2-23.1), APC (5q22.2), AR (Xq12),ARAF (Xp11.3), ARID1A (1p36.11), ATM (11q22.3), BRAF (7q34), BRCA1(17q21.31), BRCA2 (13q13.1), CCND1 (11q13.3), CCND2 (12p13.32), CCNE1(19q12), CDH1 (16q22.1), CDK4 (12q14.1), CDK6 (7q21.2), CDKN2A (9p21.3),CTNNB1 (3p22.1), DDR2 (1q23.3), EGFR (7p11.2), ERBB2 (17q12), ESR1(6q25.1-25.2), EZH2 (7q36.1), FBXW7 (4q31.3), FGFR1 (8p11.23), FGFR2(10q26.13), FGFR3 (4p16.3), GATA3 (10p14), GNA11 (19p13.3), GNAQ(9q21.2), GNAS (20q13.32), HNF1A (12q24.31), HRAS (11p15.5), IDH1(2q34), IDH2 (15q26.1), JAK2 (9p24.1), JAK3 (19p13.11), KIT (4q12), KRAS(12p12.1), MAP2K1 (15q22.31), MAP2K2 (19p13.3), MAPK1 (22q11.22), MAPK3(16p11.2), MET (7q31.2), MLH1 (3p22.2), MPL (1p34.2), MTOR (1p36.22),MYC (8q24.21), NF1 (17q11.2), NFE2L2 (2q31.2), NOTCH1 (9q34.3), NPM1(5q35.1), NRAS (1p13.2), NTRK1 (1q23.1), NTRK3 (15q25.3), PDGFRA (4q12),PIK3CA (3q26.32), PTEN (10q23.31), PTPN11 (12q24.13), RAF1 (3p25.2), RB1(13q14.2), RET (10q11.21), RHEB (7q36.1), RHOA (3p21.31), RIT1 (1q22),ROS1 (6q22.1), SMAD4 (18q21.2), SMO (7q32.1), STK11 (19p13.3), TERT(5p15.33), TP53 (17p13.1), TSC1 (9q34.13), and VHL (3p25.3).

List 2: ABL1, ACVR1B, AKT1, AKT2, AKT3, ALK, ALOX12B, AMER1 (FAM123B),APC, AR, ARAF, ARFRP1, ARID1A, ASXL1, ATM, ATR, ATRX, AURKA, AURKB,AXIN1, AXL, BAP1, BARD1, BCL2, BCL2L1, BCL2L2, BCL6, BCOR, BCORL1, BRAF,BRCA1, BRCA2, BRD4, BRIP1, BTG1, BTG2, BTK, Cllorf30 (EMSY), C17orf39(GID4), CALR, CARD11, CASP8, CBFB, CBL, CCND1, CCND2, CCND3, CCNE1,CD22, CD274 (PD-L1), CD70, CD79A, CD79B, CDC73, CDH1, CDK12, CDK4, CDK6,CDK8, CDKN1A, CDKN1B, CDKN2A, CDKN2B, CDKN2C, CEBPA, CHEK1, CHEK2, CIC,CREBBP, CRKL, CSF1R, CSF3R, CTCF, CTNNA1, CTNNB1, CUL3, CUL4A, CXCR4,CYP17A1, DAXX, DDR1, DDR2, DIS3, DNMT3A, DOT1L, EED, EGFR, EP300, EPHA3,EPHB1, EPHB4, ERBB2, ERBB3, ERBB4, ERCC4, ERG, ERRFI1, ESR1, EZH2,FAM46C, FANCA, FANCC, FANCG, FANCL, FAS, FBXW7, FGF10, FGF12, FGF14,FGF19, FGF23, FGF3, FGF4, FGF6, FGFR1, FGFR2, FGFR3, FGFR4, FH, FLCN,FLT1, FLT3, FOXL2, FUBP1, GABRA6, GATA3, GATA4, GATA6, GNA11, GNA13,GNAQ, GNAS, GRM3, GSK3B, H3F3A, HDAC1, HGF, HNF1A, HRAS, HSD3B1, ID3,IDH1, IDH2, IGF1R, IKBKE, IKZF1, INPP4B, IRF2, IRF4, IRS2, JAK1, JAK2,JAK3, JUN, KDM5A, KDM5C, KDM6A, KDR, KEAP1, KEL, KIT, KLHL6, KMT2A,KMT2D (MLL2), KRAS, LTK, LYN, MAF, MAP2K1 (MEK1), MAP2K2 (MEK2), MAP2K4,MAP3K1, MAP3K13, MAPK1, MCL1, MDM2, MDM4, MED12, MEF2B, MEN1, MERTK,MET, MITF, MKNK1, MLH1, MPL, MRE11A, MSH2, MSH3, MSH6, MST1R, MTAP,MTOR, MUTYH, MYC, MYCL (MYCL1), MYCN, MYD88, NBN, NF1, NF2, NFE2L2,NFKBIA, NKX2-1, NOTCH1, NOTCH2, NOTCH3, NPM1, NRAS, NSD3 (WHSC1L1),NT5C2, NTRK1, NTRK2, NTRK3, P2RY8, PALB2, PARK2, PARP1, PARP2, PARP3,PAX5, PBRM1, PDCD1 (PD-1), PDCD1LG2 (PD-L2), PDGFRA, PDGFRB, PDK1,PIK3C2B, PIK3C2G, PIK3CA, PIK3CB, PIK3R1, PIM1, PMS2, POLD1, POLE,PPARG, PPP2R1A, PPP2R2A, PRDM1, PRKAR1A, PRKCI, PTCH1, PTEN, PTPN11,PTPRO, QKI, RAC1, RAD21, RAD51, RAD51B, RAD51C, RAD51D, RAD52, RAD54L,RAF1, RARA, RB1, RBM10, REL, RET, RICTOR, RNF43, ROS1, RPTOR, SDHA,SDHB, SDHC, SDHD, SETD2, SF3B1, SGK1, SMAD2, SMAD4, SMARCA4, SMARCB1,SMO, SNCAIP, SOCS1, SOX2, SOX9, SPEN, SPOP, SRC, STAG2, STAT3, STK11,SUFU, SYK, TBX3, TEK, TERC, TERT, TET2, ncRNA, Promoter, TGFBR2, TIPARP,TNFAIP3, TNFRSF14, TP53, TSC1, TSC2, TYRO3, U2AF1, VEGFA, VHL, WHSC1,WT1, XPO1, XRCC2, ZNF217, and ZNF703.

Generally, probes for enrichment of nucleic acids (e.g., cfDNA obtainedfrom a liquid biopsy sample) include DNA, RNA, or a modified nucleicacid structure with a base sequence that is complementary to a locus ofinterest. For instance, a probe designed to hybridize to a locus in acfDNA molecule can contain a sequence that is complementary to eitherstrand, because the cfDNA molecules are double stranded. In someembodiments, each probe in the plurality of probes includes a nucleicacid sequence that is identical or complementary to at least 10, atleast 11, at least 12, at least 13, at least 14, or at least 15consecutive bases of a loci of interest. In some embodiments, each probein the plurality of probes includes a nucleic acid sequence that isidentical or complementary to at least 20, 25, 30, 40, 50, 75, 100, 150,200, or more consecutive bases of a locus of interest.

Targeted panels provide several benefits for nucleic acid sequencing.For example, in some embodiments, algorithms for discriminating between,e.g., a first and second cancer condition can be trained on smaller,more informative data sets (e.g., fewer genes), which leads to morecomputationally efficient training of classifiers that discriminatebetween the first and second cancer states. Such improvements incomputational efficiency, owing to the reduced size of thediscriminating gene set, can advantageously either be used to speed upclassifier training or be used to improve the performance of suchclassifiers (e.g., through more extensive training of the classifier).

In some embodiments, the gene panel is a whole-exome panel that analyzesthe exomes of a biological sample. In some embodiments, the gene panelis a whole-genome panel that analyzes the genome of a specimen. In someembodiments, the gene panel is optimized for use with liquid biopsysamples (e.g., to provide clinical decision support for solid tumors).See, for example, Table 1 above.

In some embodiments, the probes include additional nucleic acidsequences that do not share any homology to the loci of interest. Forexample, in some embodiments, the probes also include nucleic acidsequences containing an identifier sequence, e.g., a unique molecularidentifier (UMI), e.g., that is unique to a particular sample orsubject. Examples of identifier sequences are described, for example, inKivioja et al., 2011, Nat. Methods 9(1), pp. 72-74 and Islam et al.,2014, Nat. Methods 11(2), pp. 163-66, which are incorporated byreference herein. Similarly, in some embodiments, the probes alsoinclude primer nucleic acid sequences useful for amplifying the nucleicacid molecule of interest, e.g., using PCR. In some embodiments, theprobes also include a capture sequence designed to hybridize to ananti-capture sequence for recovering the nucleic acid molecule ofinterest from the sample.

Likewise, in some embodiments, the probes each include a non-nucleicacid affinity moiety covalently attached to nucleic acid molecule thatis complementary to the loci of interest, for recovering the nucleicacid molecule of interest. Non-limited examples of non-nucleic acidaffinity moieties include biotin, digoxigenin, and dinitrophenol. Insome embodiments, the probe is attached to a solid-state surface orparticle, e.g., a dipstick or magnetic bead, for recovering the nucleicacid of interest. In some embodiments, the methods described hereininclude amplifying the nucleic acids that bound to the probe set priorto further analysis, e.g., sequencing. Methods for amplifying nucleicacids, e.g., by PCR, are well known in the art.

Sequence reads are then generated (312) from the sequencing library orpool of sequencing libraries. Sequencing data may be acquired by anymethodology known in the art. For example, next generation sequencing(NGS) techniques such as sequencing-by-synthesis technology (Illumina),pyrosequencing (454 Life Sciences), ion semiconductor technology (IonTorrent sequencing), single-molecule real-time sequencing (PacificBiosciences), sequencing by ligation (SOLiD sequencing), nanoporesequencing (Oxford Nanopore Technologies), or paired-end sequencing. Insome embodiments, massively parallel sequencing is performed usingsequencing-by-synthesis with reversible dye terminators. In someembodiments, sequencing is performed using next generation sequencingtechnologies, such as short-read technologies. In other embodiments,long-read sequencing or another sequencing method known in the art isused.

Next-generation sequencing produces millions of short reads (e.g.,sequence reads) for each biological sample. Accordingly, in someembodiments, the plurality of sequence reads obtained by next-generationsequencing of cfDNA molecules are DNA sequence reads. In someembodiments, the sequence reads have an average length of at least fiftynucleotides. In other embodiments, the sequence reads have an averagelength of at least 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, or morenucleotides.

In some embodiments, sequencing is performed after enriching for nucleicacids (e.g., cfDNA, gDNA, and/or RNA) encompassing a plurality ofpredetermined target sequences, e.g., human genes and/or non-codingsequences associated with cancer. Advantageously, sequencing a nucleicacid sample that has been enriched for target nucleic acids, rather thanall nucleic acids isolated from a biological sample, significantlyreduces the average time and cost of the sequencing reaction.Accordingly, in some embodiments, the methods described herein includeobtaining a plurality of sequence reads of nucleic acids that have beenhybridized to a probe set for hybrid-capture enrichment (e.g., of one ormore genes listed in Table 1).

In some embodiments, panel-targeting sequencing is performed to anaverage on-target depth of at least 500×, at least 750×, at least 1000×,at least 2500×, at least 500×, at least 10,000×, or greater depth. Insome embodiments, samples are further assessed for uniformity above asequencing depth threshold (e.g., 95% of all targeted base pairs at 300×sequencing depth). In some embodiments, the sequencing depth thresholdis a minimum depth selected by a user or practitioner.

In some embodiments, the sequence reads are obtained by a whole genomeor whole exome sequencing methodology. In some such embodiments, wholeexome capture is performed with an automated system, using a liquidhandling robot (for example, a SciClone NGSx). Whole genome sequencing,and to some extent whole exome sequencing, is typically performed atlower sequencing depth than smaller target-panel sequencing reactions,because many more loci are being sequenced. For example, in someembodiments, whole genome or whole exome sequencing is performed to anaverage sequencing depth of at least 3×, at least 5×, at least 10×, atleast 15×, at least 20×, or greater. In some embodiments, low-pass wholegenome sequencing (LPWGS) techniques are used for whole genome or wholeexome sequencing. LPWGS is typically performed to an average sequencingdepth of about 0.25× to about 5×, more typically to an averagesequencing depth of about 0.5× to about 3×.

Because of the differences in the sequencing methodologies, dataobtained from targeted-panel sequencing is better suited for certainanalyses than data obtained from whole genome/whole exome sequencing,and vice versa. For instance, because of the higher sequencing depthachieved by targeted-panel sequencing, the resulting sequence data isbetter suited for the identification of variant alleles present at lowallelic fractions in the sample, e.g., less than 20%. By contrast, datagenerated from whole genome/whole exome sequencing is better suited forthe estimation of genome-wide metrics, such as tumor mutational burden,because the entire genome is better represented in the sequencing data.Accordingly, in some embodiments, a nucleic acid sample, e.g., a cfDNA,gDNA, or mRNA sample, is evaluated using both targeted-panel sequencingand whole genome/whole exome sequencing (e.g., LPWGS).

In some embodiments, the raw sequence reads resulting from thesequencing reaction are output from the sequencer in a native fileformat, e.g., a BCL file. In some embodiments, the native file is passeddirectly to a bioinformatics pipeline (e.g., variant analysis 206),components of which are described in detail below. In other embodiments,pre-processing is performed prior to passing the sequences to thebioinformatics platform. For instance, in some embodiments, the formatof the sequence read file is converted from the native file format(e.g., BCL) to a file format compatible with one or more algorithms usedin the bioinformatics pipeline (e.g., FASTQ or FASTA). In someembodiments, the raw sequence reads are filtered to remove sequencesthat do not meet one or more quality thresholds. In some embodiments,raw sequence reads generated from the same unique nucleic acid moleculein the sequencing read are collapsed into a single sequence readrepresenting the molecule, e.g., using UMIs as described above. In someembodiments, one or more of these pre-processing activities is performedwithin the bioinformatics pipeline itself.

In one example, a sequencer may generate a BCL file. A BCL file mayinclude raw image data of a plurality of patient specimens which aresequenced. BCL image data is an image of the flow cell across each cycleduring sequencing. A cycle may be implemented by illuminating a patientspecimen with a specific wavelength of electromagnetic radiation,generating a plurality of images which may be processed into base callsvia BCL to FASTQ processing algorithms which identify which base pairsare present at each cycle. The resulting FASTQ file includes theentirety of reads for each patient specimen paired with a qualitymetric, e.g., in a range from 0 to 64 where a 64 is the best quality anda 0 is the worst quality. In embodiments where both a liquid biopsysample and a normal tissue sample are sequenced, sequence reads in thecorresponding FASTQ files may be matched, such that a liquidbiopsy-normal analysis may be performed.

FASTQ format is a text-based format for storing both a biologicalsequence, such as a nucleotide sequence, and its corresponding qualityscores. These FASTQ files are analyzed to determine what geneticvariants or copy number changes are present in the sample. Each FASTQfile contains reads that may be paired-end or single reads, and may beshort-reads or long-reads, where each read represents one detectedsequence of nucleotides in a nucleic acid molecule that was isolatedfrom the patient sample or a copy of the nucleic acid molecule, detectedby the sequencer. Each read in the FASTQ file is also associated with aquality rating. The quality rating may reflect the likelihood that anerror occurred during the sequencing procedure that affected theassociated read. In some embodiments, the results of paired-endsequencing of each isolated nucleic acid sample are contained in a splitpair of FASTQ files, for efficiency. Thus, in some embodiments, forward(Read 1) and reverse (Read 2) sequences of each isolated nucleic acidsample are stored separately but in the same order and under the sameidentifier.

In various embodiments, the bioinformatics pipeline may filter FASTQdata from the corresponding sequence data file for each respectivebiological sample. Such filtering may include correcting or maskingsequencer errors and removing (trimming) low quality sequences or bases,adapter sequences, contaminations, chimeric reads, overrepresentedsequences, biases caused by library preparation, amplification, orcapture, and other errors.

While workflow 200 illustrates obtaining a biological sample, extractingnucleic acids from the biological sample, and sequencing the isolatednucleic acids, in some embodiments, sequencing data used in the improvedsystems and methods described herein (e.g., which include improvedmethods for validating a somatic sequence variant in a test subjecthaving a cancer condition) is obtained by receiving previously generatedsequence reads, in electronic form.

Referring again to FIG. 2A, nucleic acid sequencing data 122 generatedfrom the one or more patient samples is then evaluated (e.g., viavariant analysis 206) in a bioinformatics pipeline, e.g., usingbioinformatics module 140 of system 100, to identify genomic alterationsand other metrics in the cancer genome of the patient. An exampleoverview for a bioinformatics pipeline is described below with respectto FIGS. 4A-4F. Advantageously, in some embodiments, the presentdisclosure improves bioinformatics pipelines, like pipeline 206, byimproving methods and systems of validating somatic sequence variants.

FIG. 4A illustrates an example bioinformatics pipeline 206 (e.g., asused for feature extraction in the workflows illustrated in FIGS. 2A and3) for providing clinical support for precision oncology. As shown inFIG. 4A, sequencing data 122 obtained from the wet lab processing 204(e.g., sequence reads 314) is input into the pipeline.

In various embodiments, the bioinformatics pipeline includes acirculating tumor DNA (ctDNA) pipeline for analyzing liquid biopsysamples. The pipeline may detect SNVs, INDELs, copy numberamplifications/deletions and genomic rearrangements (for example,fusions). The pipeline may employ unique molecular index (UMI)-basedconsensus base calling as a method of error suppression as well as aBayesian tri-nucleotide context-based position level error suppression.In various embodiments, it is able to detect variants having a 0.1%,0.15%, 0.2%, 0.25%, 0.3%, 0.4%, or 0.5% variant allele fraction.

In some embodiments, the sequencing data is processed (e.g., usingsequence data processing module 141) to prepare it for genomic featureidentification 385. For instance, in some embodiments as describedabove, the sequencing data is present in a native file format providedby the sequencer. Accordingly, in some embodiments, the system (e.g.,system 100) applies a pre-processing algorithm 142 to convert the fileformat (318) to one that is recognized by one or more upstreamprocessing algorithms. For example, BCL file outputs from a sequencercan be converted to a FASTQ file format using the bcl2fastq orbcl2fastq2 conversion software (Illumina®). FASTQ format is a text-basedformat for storing both a biological sequence, such as nucleotidesequence, and its corresponding quality scores. These FASTQ files areanalyzed to determine what genetic variants, copy number changes, etc.,are present in the sample.

In some embodiments, other preprocessing functions are performed, e.g.,filtering sequence reads 122 based on a desired quality, e.g., sizeand/or quality of the base calling. In some embodiments, quality controlchecks are performed to ensure the data is sufficient for variantcalling. For instance, entire reads, individual nucleotides, or multiplenucleotides that are likely to have errors may be discarded based on thequality rating associated with the read in the FASTQ file, the knownerror rate of the sequencer, and/or a comparison between each nucleotidein the read and one or more nucleotides in other reads that has beenaligned to the same location in the reference genome. Filtering may bedone in part or in its entirety by various software tools, for example,a software tool such as Skewer. See, Jiang, H. et al., BMCBioinformatics 15(182):1-12 (2014). FASTQ files may be analyzed forrapid assessment of quality control and reads, for example, by asequencing data QC software such as AfterQC, Kraken, RNA-SeQC, FastQC,or another similar software program. For paired end reads, reads may bemerged.

In some embodiments, when both a liquid biopsy sample and a normaltissue sample from the patient are sequenced, two FASTQ output files aregenerated, one for the liquid biopsy sample and one for the normaltissue sample. A ‘matched’ (e.g., panel-specific) workflow is run tojointly analyze the liquid biopsy-normal matched FASTQ files. When amatched normal sample is not available from the patient, FASTQ filesfrom the liquid biopsy sample are analyzed in the ‘tumor-only’ mode.See, for example, FIG. 4B. If two or more patient samples are processedsimultaneously on the same sequencer flow cell, e.g., a liquid biopsysample and a normal tissue sample, a difference in the sequence of theadapters used for each patient sample barcodes nucleic acids extractedfrom both samples, to associate each read with the correct patientsample and facilitate assignment to the correct FASTQ file.

For efficiency, in some embodiments, the results of paired-endsequencing of each isolate are contained in a split pair of FASTQ files.Forward (Read 1) and reverse (Read 2) sequences of each tumor and normalisolate are stored separately but in the same order and under the sameidentifier. See, for example, FIG. 4C. In various embodiments, thebioinformatics pipeline may filter FASTQ data from each isolate. Suchfiltering may include correcting or masking sequencer errors andremoving (trimming) low quality sequences or bases, adapter sequences,contaminations, chimeric reads, overrepresented sequences, biases causedby library preparation, amplification, or capture, and other errors.See, for example, FIG. 4D.

Similarly, in some embodiments, sequencing (312) is performed on a poolof nucleic acid sequencing libraries prepared from different biologicalsamples, e.g., from the same or different patients. Accordingly, in someembodiments, the system demultiplexes (320) the data (e.g., usingdemultiplexing algorithm 144) to separate sequence reads into separatefiles for each sequencing library included in the sequencing pool, e.g.,based on UMI or patient identifier sequences added to the nucleic acidfragments during sequencing library preparation, as described above. Insome embodiments, the demultiplexing algorithm is part of the samesoftware package as one or more pre-processing algorithms 142. Forinstance, the bcl2fastq or bcl2fastq2 conversion software (Illumina®)include instructions for both converting the native file format outputfrom the sequencer and demultiplexing sequence reads 122 output from thereaction.

The sequence reads are then aligned (322), e.g., using an alignmentalgorithm 143, to a reference sequence construct 158, e.g., a referencegenome, reference exome, or other reference construct prepared for aparticular targeted-panel sequencing reaction. For example, in someembodiments, individual sequence reads 123, in electronic form (e.g., inFASTQ files), are aligned against a reference sequence construct for thespecies of the subject (e.g., a reference human genome) by identifying asequence in a region of the reference sequence construct that bestmatches the sequence of nucleotides in the sequence read. In someembodiments, the sequence reads are aligned to a reference exome orreference genome using known methods in the art to determine alignmentposition information. The alignment position information may indicate abeginning position and an end position of a region in the referencegenome that corresponds to a beginning nucleotide base and endnucleotide base of a given sequence read. Alignment position informationmay also include sequence read length, which can be determined from thebeginning position and end position. A region in the reference genomemay be associated with a gene or a segment of a gene. Any of a varietyof alignment tools can be used for this task.

For instance, local sequence alignment algorithms compare subsequencesof different lengths in the query sequence (e.g., sequence read) tosubsequences in the subject sequence (e.g., reference construct) tocreate the best alignment for each portion of the query sequence. Incontrast, global sequence alignment algorithms align the entirety of thesequences, e.g., end to end. Examples of local sequence alignmentalgorithms include the Smith-Waterman algorithm (see, for example, Smithand Waterman, J Mol. Biol., 147(1):195-97 (1981), which is incorporatedherein by reference), Lalign (see, for example, Huang and Miller, Adv.Appl. Math, 12:337-57 (1991), which is incorporated by referenceherein), and PatternHunter (see, for example, Ma B. et al.,Bioinformatics, 18(3):440-45 (2002), which is incorporated by referenceherein).

In some embodiments, the read mapping process starts by building anindex of either the reference genome or the reads, which is then used toretrieve the set of positions in the reference sequence where the readsare more likely to align. Once this subset of possible mapping locationshas been identified, alignment is performed in these candidate regionswith slower and more sensitive algorithms. See, for example, Hatem etal., 2013, “Benchmarking short sequence mapping tools,” BMCBioinformatics 14: p. 184; and Flicek and Birney, 2009, “Sense fromsequence reads: methods for alignment and assembly,” Nat Methods 6(Suppl. 11), S6-S12, each of which is hereby incorporated by reference.In some embodiments, the mapping tools methodology makes use of a hashtable or a Burrows—Wheeler transform (BWT). See, for example, Li andHomer, 2010, “A survey of sequence alignment algorithms fornext-generation sequencing,” Brief Bioinformatics 11, pp. 473-483, whichis hereby incorporated by reference.

Other software programs designed to align reads include, for example,Novoalign (Novocraft, Inc.), Bowtie, Burrows Wheeler Aligner (BWA),and/or programs that use a Smith-Waterman algorithm. Candidate referencegenomes include, for example, hg19, GRCh38, hg38, GRCh37, and/or otherreference genomes developed by the Genome Reference Consortium. In someembodiments, the alignment generates a SAM file, which stores thelocations of the start and end of each read according to coordinates inthe reference genome and the coverage (number of reads) for eachnucleotide in the reference genome.

For example, in some embodiments, each read of a FASTQ file is alignedto a location in the human genome having a sequence that best matchesthe sequence of nucleotides in the read. There are many softwareprograms designed to align reads, for example, Novoalign (Novocraft,Inc.), Bowtie, Burrows Wheeler Aligner (BWA), programs that use aSmith-Waterman algorithm, etc. Alignment may be directed using areference genome (for example, hg19, GRCh38, hg38, GRCh37, otherreference genomes developed by the Genome Reference Consortium, etc.) bycomparing the nucleotide sequences in each read with portions of thenucleotide sequence in the reference genome to determine the portion ofthe reference genome sequence that is most likely to correspond to thesequence in the read. In some embodiments, one or more SAM files aregenerated for the alignment, which store the locations of the start andend of each read according to coordinates in the reference genome andthe coverage (number of reads) for each nucleotide in the referencegenome. The SAM files may be converted to BAM files. In someembodiments, the BAM files are sorted, and duplicate reads are markedfor deletion, resulting in de-duplicated BAM files.

In some embodiments, adapter-trimmed FASTQ files are aligned to the 19thedition of the human reference genome build (HG19) using Burrows-WheelerAligner (BWA, Li and Durbin, Bioinformatics, 25(14):1754-60 (2009)).Following alignment, reads are grouped by alignment position and UMIfamily and collapsed into consensus sequences, for example, using fgbiotools (fulcrumgenomics.github.io/fgbio/). Bases with insufficientquality or significant disagreement among family members (for example,when it is uncertain whether the base is an adenine, cytosine, guanine,etc.) may be replaced by N's to represent a wildcard nucleotide type.PHRED scores are then scaled based on initial base calling estimatescombined across all family members. Following single-strand consensusgeneration, duplex consensus sequences are generated by comparing theforward and reverse oriented PCR products with mirrored UMI sequences.In various embodiments, a consensus can be generated across read pairs.Otherwise, single-strand consensus calls will be used. Followingconsensus calling, filtering is performed to remove low-qualityconsensus fragments. The consensus fragments are then re-aligned to thehuman reference genome using BWA. A BAM output file is generated afterthe re-alignment, then sorted by alignment position, and indexed.

In some embodiments, where both a liquid biopsy sample and a normaltissue sample are analyzed, this process produces a liquid biopsy BAMfile (e.g., Liquid BAM 124-1-i-cf) and a normal BAM file (e.g., GermlineBAM 124-1-i-g), as illustrated in FIG. 4A. In various embodiments, BAMfiles may be analyzed to detect genetic variants and other geneticfeatures, including single nucleotide variants (SNVs), copy numbervariants (CNVs), gene rearrangements, etc.

In some embodiments, the sequencing data is normalized, e.g., to accountfor pull-down, amplification, and/or sequencing bias (e.g., mappability,GC bias etc.). See, for example, Schwartz et al., PLoS ONE 6(1):e16685(2011) and Benjamini and Speed, Nucleic Acids Research 40(10):e72(2012), the contents of which are hereby incorporated by reference, intheir entireties, for all purposes.

In some embodiments, SAM files generated after alignment are convertedto BAM files 124. Thus, after preprocessing sequencing data generatedfor a pooled sequencing reaction, BAM files are generated for each ofthe sequencing libraries present in the master sequencing pools. Forexample, as illustrated in FIG. 4A, separate BAM files are generated foreach of three samples acquired from subject 1 at time i (e.g., tumor BAM124-1-i-t corresponding to alignments of sequence reads of nucleic acidsisolated from a solid tumor sample from subject 1, Liquid BAM 124-1-i-cfcorresponding to alignments of sequence reads of nucleic acids isolatedfrom a liquid biopsy sample from subject 1, and Germline BAM 124-1-i-gcorresponding to alignments of sequence reads of nucleic acids isolatedfrom a normal tissue sample from subject 1), and one or more samplesacquired from one or more additional subjects at time j (e.g., Tumor BAM124-2-j-t corresponding to alignments of sequence reads of nucleic acidsisolated from a solid tumor sample from subject 2). In some embodiments,BAM files are sorted, and duplicate reads are marked for deletion,resulting in de-duplicated BAM files. For example, tools like SamBAMBAmark and filter duplicate alignments in the sorted BAM files.

Many of the embodiments described below, in conjunction with FIGS.4A-4F, relate to analyses performed using sequencing data from cfDNA ofa cancer patient, e.g., obtained from a liquid biopsy sample of thepatient. Generally, these embodiments are independent and, thus, notreliant upon any particular sequencing data generation methods, e.g.,sample preparation, sequencing, and/or data pre-processingmethodologies. However, in some embodiments, the methods described belowinclude one or more features 204 of generating sequencing data, asillustrated in FIGS. 2A and 3.

Alignment files prepared as described above (e.g., BAM files 124) arethen passed to a feature extraction module 145, where the sequences areanalyzed (324) to identify genomic alterations (e.g., SNVs/MNVs, indels,genomic rearrangements, copy number variations, etc.) and/or determinevarious characteristics of the patient's cancer (e.g., MSI status, TMB,tumor ploidy, HRD status, tumor fraction, tumor purity, methylationpatterns, etc.). Many software packages for identifying genomicalterations are known in the art, for example, freebayes, PolyBayse,samtools, GATK, pindel, SAMtools, Breakdancer, Cortex, Crest, Delly,Gridss, Hydra, Lumpy, Manta, and Socrates. For a review of many of thesevariant calling packages see, for example, Cameron, D. L. et al., Nat.Commun., 10(3240):1-11 (2019), the content of which is herebyincorporated by reference, in its entirety, for all purposes. Generally,these software packages identify variants in sorted SAM or BAM files124, relative to one or more reference sequence constructs 158. Thesoftware packages then output a file e.g., a raw VCF (variant callformat), listing the variants (e.g., genomic features 131) called andidentifying their location relevant to the reference sequence construct(e.g., where the sequence of the sample nucleic acids differ from thecorresponding sequence in the reference construct). In some embodiments,system 100 digests the contents of the native output file to populatefeature data 125 in test patient data store 120. In other embodiments,the native output file serves as the record of these genomic features131 in test patient data store 120.

Generally, the systems described herein can employ any combination ofavailable variant calling software packages and internally developedvariant identification algorithms. In some embodiments, the output of aparticular algorithm of a variant calling software is further evaluated,e.g., to improve variant identification. Accordingly, in someembodiments, system 100 employs an available variant calling softwarepackage to perform some of all of the functionality of one or more ofthe algorithms shown in feature extraction module 145.

In some embodiments, as illustrated in FIG. 1A, separate algorithms (orthe same algorithm implemented using different parameters) are appliedto identify variants unique to the cancer genome of the patient andvariants existing in the germline of the subject. In other embodiments,variants are identified indiscriminately and later classified as eithergermline or somatic, e.g., based on sequencing data, population data, ora combination thereof. In some embodiments, variants are classified asgermline variants, and/or non-actionable variants, when they arerepresented in the population above a threshold level, e.g., asdetermined using a population database such as ExAC or gnomAD. Forinstance, in some embodiments, variants that are represented in at least1% of the alleles in a population are annotated as germline and/ornon-actionable. In other embodiments, variants that are represented inat least 2%, at least 3%, at least 4%, at least 5%, at least 7.5%, atleast 10%, or more of the alleles in a population are annotated asgermline and/or non-actionable. In some embodiments, sequencing datafrom a matched sample from the patient, e.g., a normal tissue sample, isused to annotate variants identified in a cancerous sample from thesubject. That is, variants that are present in both the cancerous sampleand the normal sample represent those variants that were in the germlineprior to the patient developing cancer and can be annotated as germlinevariants.

In various aspects, the detected genetic variants and genetic featuresare analyzed as a form of quality control. For example, a pattern ofdetected genetic variants or features may indicate an issue related tothe sample, sequencing procedure, and/or bioinformatics pipeline (e.g.,example, contamination of the sample, mislabeling of the sample, achange in reagents, a change in the sequencing procedure and/orbioinformatics pipeline, etc.).

FIG. 4E illustrates an example workflow for genomic featureidentification (324). This particular workflow is only an example of onepossible collection and arrangement of algorithms for feature extractionfrom sequencing data 124. Generally, any combination of the modules andalgorithms of feature extraction module 145, e.g., illustrated in FIG.1A, can be used for a bioinformatics pipeline, and particularly for abioinformatics pipeline for analyzing liquid biopsy samples. Forinstance, in some embodiments, an architecture useful for the methodsand systems described herein includes at least one of the modules orvariant calling algorithms shown in feature extraction module 145. Insome embodiments, an architecture includes at least 2, 3, 4, 5, 6, 7, 8,9, 10, or more of the modules or variant calling algorithms shown infeature extraction module 145. Further, in some embodiments, featureextraction modules and/or algorithms not illustrated in FIG. 1A find usein the methods and systems described herein.

Variant Identification

In some embodiments, variant analysis of aligned sequence reads, e.g.,in SAM or BAM format, includes identification of single nucleotidevariants (SNVs), multiple nucleotide variants (MNVs), indels (e.g.,nucleotide additions and deletions), and/or genomic rearrangements(e.g., inversions, translocations, and gene fusions) using variantidentification module 146, e.g., which includes a SNV/MNV callingalgorithm (e.g., SNV/MNV calling algorithm 147), an indel callingalgorithm (e.g., indel calling algorithm 148), and/or one or moregenomic rearrangement calling algorithms (e.g., genomic rearrangementcalling algorithm 149). An overview of an example method for variantidentification is shown in FIG. 4E. Essentially, the module firstidentifies a difference between the sequence of an aligned sequence read124 and the reference sequence to which the sequence read is aligned(e.g., an SNV/MNV, an indel, or a genomic rearrangement) and makes arecord of the variant, e.g., in a variant call format (VCF) file. Forinstance, software packages such as freebayes and pindel are used tocall variants using sorted BAM files and reference BED files as theinput. For a review of variant calling packages see, for example,Cameron, D. L. et al., Nat. Commun., 10(3240):1-11 (2019). A raw VCFfile (variant call format) file is output, showing the locations wherethe nucleotide base in the sample is not the same as the nucleotide basein that position in the reference sequence construct.

In some embodiments, as illustrated in FIG. 4E, raw VCF data is thennormalized, e.g., by parsimony and left alignment. For example, softwarepackages such as vcfbreakmulti and vt are used to normalizemulti-nucleotide polymorphic variants in the raw VCF file and a variantnormalized VCF file is output. See, for example, E. Garrison, “Vcflib: AC++ library for parsing and manipulating VCF files, GitHub, available onthe internet at github.com/ekg/vcflib (2012), the content of which ishereby incorporated by reference, in its entirety, for all purposes. Insome embodiments, a normalization algorithm is included within thearchitecture of a broader variant identification software package.

An algorithm is then used to annotate the variants in the (e.g.,normalized) VCF file, e.g., determines the source of the variation,e.g., whether the variant is from the germline of the subject (e.g., agermline variant), a cancerous tissue (e.g., a somatic variant), asequencing error, or of an undeterminable source. In some embodiments,an annotation algorithm is included within the architecture of a broadervariant identification software package. However, in some embodiments,an external annotation algorithm is applied to (e.g., normalized) VCFdata obtained from a conventional variant identification softwarepackage. The choice to use a particular annotation algorithm is wellwithin the purview of the skilled artisan, and in some embodiments isbased upon the data being annotated.

For example, in some embodiments, where both a liquid biopsy sample anda normal tissue sample of the patient are analyzed, variants identifiedin the normal tissue sample inform annotation of the variants in theliquid biopsy sample. In some embodiments, where a particular variant isidentified in the normal tissue sample, that variant is annotated as agermline variant in the liquid biopsy sample. Similarly, in someembodiments, where a particular variant identified in the liquid biopsysample is not identified in the normal tissue sample, the variant isannotated as a somatic variant when the variant otherwise satisfies anyadditional criteria placed on somatic variant calling, e.g., a thresholdvariant allele fraction (VAF) in the sample.

By contrast, in some embodiments, where only a liquid biopsy sample isbeing analyzed, the annotation algorithm relies on other characteristicsof the variant in order to annotate the origin of the variant. Forinstance, in some embodiments, the annotation algorithm evaluates theVAF of the variant in the sample, e.g., alone or in combination withadditional characteristics of the sample, e.g., tumor fraction.Accordingly, in some embodiments, where the VAF is within a first rangeencompassing a value that corresponds to a 1:1 distribution of variantand reference alleles in the sample, the algorithm annotates the variantas a germline variant, because it is presumably represented in cfDNAoriginating from both normal and cancer tissues. Similarly, in someembodiments, where the VAF is below a baseline variant threshold, thealgorithm annotates the variant as undeterminable, because there is notsufficient evidence to distinguish between the possibility that thevariant arose as a result of an amplification or sequencing error andthe possibility that the variant originated from a cancerous tissue.Similarly, in some embodiments, where the VAF falls between the firstrange and the baseline variant threshold, the algorithm annotates thevariant as a somatic variant.

In some embodiments, the baseline variant threshold is a value from0.01% VAF to 0.5% VAF. In some embodiments, the baseline variantthreshold is a value from 0.05% VAF to 0.35% VAF. In some embodiments,the baseline variant threshold is a value from 0.1% VAF to 0.25% VAF. Insome embodiments, the baseline variant threshold is about 0.01% VAF,0.015% VAF, 0.02% VAF, 0.025% VAF, 0.03% VAF, 0.035% VAF, 0.04% VAF,0.045% VAF, 0.05% VAF, 0.06% VAF, 0.07% VAF, 0.075% VAF, 0.08% VAF,0.09% VAF, 0.1% VAF, 0.15% VAF, 0.2% VAF, 0.25% VAF, 0.3% VAF, 0.35%VAF, 0.4% VAF, 0.45% VAF, 0.5% VAF, or greater. In some embodiments, thebaseline variant threshold is different for variants located in a firstregion, e.g., a region identified as a mutational hotspot and/or havinghigh genomic complexity, than for variants located in a second region,e.g., a region that is not identified as a mutational hotspot and/orhaving average genomic complexity. For example, in some embodiments, thebaseline variant threshold is a value from 0.01% to 0.25% for variantslocated in the first region and is a value from 0.1% to 0.5% forvariants located in the second region.

In some embodiments, the first region is a region of interest in thegenome that may have been manually selected based on criteria (forexample, selection may be based on a known likelihood that a region isassociated with variants) and the second region is a region that did notmeet the selection criteria. In some embodiments, the baseline variantthreshold is a value from 0.01% to 0.5% for variants located in thefirst region and is a value from 1% to 5% for variants located in thesecond region. In some embodiments, the first region is a region ofinterest in the genome that may have been manually selected based oncriteria (for example, selection may be based on a known likelihood thata region is associated with variants) and the second region is a regionselected based on a second set of criteria.

In some embodiments, a baseline variant threshold is influenced by thesequencing depth of the reaction, e.g., a locus-specific sequencingdepth and/or an average sequencing depth (e.g., across a targeted paneland/or complete reference sequence construct). In some embodiments, thebaseline variant threshold is dependent upon the type of variant beingdetected. For example, in some embodiments, different baseline variantthresholds are set for SNPs/MNVs than for indels and/or genomicrearrangements. For instance, while an apparent SNP may be introduced byamplification and/or sequencing errors, it is much less likely that agenomic rearrangement is introduced this way and, thus, a lower baselinevariant threshold may be appropriate for genomic rearrangements than forSNP s/MNVs.

In some embodiments, one or more additional criteria are required to besatisfied before a variant can be annotated as a somatic variant. Forinstance, in some embodiments, a threshold number of unique sequencereads encompassing the variant must be present to annotate the variantas somatic. In some embodiments, the threshold number of unique sequencereads is 2, 3, 4, 5, 7, 10, 12, 15, or greater. In some embodiments, thethreshold number of unique sequence reads is only applied when certainconditions are met, e.g., when the variant allele is located in a regionof a certain genomic complexity. In some embodiments, the certaingenomic complexity is a low genomic complexity. In some embodiments, thecertain genomic complexity is an average genomic complexity. In someembodiments, the certain genomic complexity is a high genomiccomplexity.

In some embodiments, a threshold sequencing coverage, e.g., alocus-specific and/or an average sequencing depth (e.g., across atargeted panel and/or complete reference sequence construct) must besatisfied to annotate the variant as somatic. In some embodiments, thethreshold sequencing coverage is 50×, 100×, 150×, 200×, 250×, 300×,350×, 400× or greater. In some embodiments, the variant is located in amicrosatellite instable (MSI) region. In some embodiments, the variantis not located in a microsatellite instable (MSI) region. In someembodiments, the variant has sufficient signal-to-noise ratio.

In some embodiments, bases contributing to the variant satisfy athreshold mapping quality to annotate the variant as somatic. In someembodiments, alignments contributing to the variant must satisfy athreshold alignment quality to annotate the variant as somatic. In someembodiments, a threshold value is determined for a variant detected in asomatic (cancer) sample by analyzing the threshold metric (for example,the baseline variant threshold is determined by analyzing VAF, or thethreshold sequencing coverage is determined by analyzing coverage)associated with that variant in a group of germline (normal) samplesthat were each processed by the same sample processing and sequencingprotocol as the somatic sample (process-matched). This may be used toensure the variants are not caused by observed artifact generatingprocesses.

In some embodiments, the threshold value is set above the median basefraction of the threshold metric value associated with the variant inmore than a specified percentage of process-matched germline samples,e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more standard deviations abovethe median base fraction of the threshold metric value associated with25%, 30, 40, 50, 60, 70, 75, or more of the processed-matched germlinesamples. For example, in one embodiment, the threshold value is set to avalue 5 standard deviations above the median base fraction of thethreshold metric value associated with the variant in more than 50% ofthe process matched germline samples.

In some embodiments, variants around homopolymer and multimer regionsknown to generate artifacts may be specifically filtered to avoid suchartifacts. For example, in some embodiments, strand specific filteringis performed in the direction of the read in order to minimize strandedartifacts. Similarly, in some embodiments, variants that do not exceedthe stranded minimum deviation for their specific locus within a knownartifact-generating region may be filtered to avoid artifacts.

Variants may be filtered using dynamic methods, such as through theapplication of Bayes' Theorem through a likelihood ratio test. In somesuch embodiments, the threshold is dynamically calibrated to account forvariants with low support (e.g., due to low tumor fraction, lowcirculating tumor fraction, and/or low sequencing depths). The dynamicthreshold may be based on, for example, factors such as sample specificerror rate, the error rate from a healthy reference pool (e.g., a poolof process matched healthy control samples for validation of variantsdetected in tumor samples), and information from internal human solidtumors (e.g., for validation of variants detected in liquid biopsysamples). Accordingly, in some embodiments, the dynamic filtering methodemploys a tri-nucleotide context-based Bayesian model. That is, in someembodiments, the threshold for filtering any particular putative variantis dynamically calibrated using a context-based Bayesian model thatconsiders one or more of a sample-specific sequencing error rate, aprocess-matched control sequencing error rate, and/or a variant-specificfrequency (e.g., determined from similar cancers). In this fashion, aminimum number of alternative alleles required to positively identify atrue variant is determined for individual alleles and/or loci.

In some embodiments, the dynamic threshold is selected from a Bayesianprobability model, where the selection is based on one or more errorrates and/or information from one or more baseline variantdistributions. For example, in some embodiments, the dynamic thresholdis selected based on a variant detection specificity that is calculatedusing a distribution of variant detection sensitivities, where thedistribution of variant detection sensitivities is a function ofcirculating variant allele fraction from a plurality of baseline and/orreference alleles (e.g., from a cohort of subjects). Filtration ofvariants using a dynamic threshold (e.g., to validate the presence of asomatic variant) is performed by comparing the number of unique sequencereads encompassing the variant (e.g., a variant allele fragment countfor the variant) against the dynamic threshold.

As described herein, in some embodiments, the methods described herein(e.g., methods 400, 450, and 500 as illustrated in FIGS. 4 and 5)include one or more data collection steps, in addition to data analysisand downstream steps. For example, as described herein, e.g., withreference to FIGS. 2 and 3, in some embodiments, the methods includecollection of a liquid biopsy sample and, optionally, one or morematching biological samples from the subject (e.g., a matched cancerousand/or matched non-cancerous sample from the subject). Likewise, asdescribed herein, e.g., with reference to FIGS. 2 and 3, in someembodiments, the methods include extraction of DNA from the liquidbiopsy sample and, optionally, one or more matching biological samplesfrom the subject (e.g., a matched cancerous and/or matched non-canceroussample from the subject). Similarly, as herein, e.g., with reference toFIGS. 2 and 3, in some embodiments, the methods include nucleic acidsequencing of DNA from the liquid biopsy sample and, optionally, one ormore matching biological samples from the subject (e.g., a matchedcancerous and/or matched non-cancerous sample from the subject).

However, in other embodiments, the methods described herein begin withobtaining nucleic acid sequencing results, e.g., raw or collapsedsequence reads of DNA from a liquid biopsy sample and, optionally, oneor more matching biological samples from the subject (e.g., a matchedcancerous and/or matched non-cancerous sample from the subject), fromwhich the statistics needed for somatic variant identification (e.g.,variant allele count 133-ac and/or variant allele fraction 133-af) canbe determined. For example, in some embodiments, sequencing data 122 fora patient 121 is accessed and/or downloaded over network 105 by system100.

Similarly, in some embodiments, the methods described herein begin withobtaining the genomic features needed for somatic variant identification(e.g., variant allele count 133-ac and/or variant allele fraction133-af) for a sequencing of a liquid biopsy sample and, optionally, oneor more matching biological samples from the subject (e.g., a matchedcancerous and/or matched non-cancerous sample from the subject). Forexample, in some embodiments, variant allele counts 133-cf-ac and/orvariant allele fractions 133-cf-af for sequencing data 122 of patient121 is accessed and/or downloaded over network 105 by system 100.

One goal of the liquid biopsy assays described herein is to detectvariant alterations at low circulating fractions, which requires thatlow levels of support be sufficient to call a variant. Therefore,consistent thresholds to filter variants that do not take into accountvariant context and local sequence specific error cannot be used.

In some embodiments, a dynamic variant filtering method is applied whichuses an application of Bayes' Theorem through the likelihood ratio test.The dynamic threshold is based on sample specific error rate, the errorrate from a healthy reference pool, and from internal human solidtumors. The basic application of the likelihood ratio test is asfollows:

post-test-odds=pre-test-odds*sensitivity/(1−specificity)

Given a fixed value for post-test-adds, the specificity can be solvedfor. The specificity represents the minimum acceptable quantile of anerror distribution (e.g., a BetaBinomial, Beta, and Poisson errordistribution). The above equation can be refactored to the one below:

specificity=1−pre-test-odds*sensitivity/post-test-odds

Specificity can then be plugged into the quantile error (e.g.,BetaBinomial, Beta, or Poisson) function to derive the minimum number ofalternative alleles that can be observed at a given depth to validate acandidate somatic variant.

In some embodiments, the post-test odds are post-testprobability/(1−post-test probability). The post-test probability is theprobability of having a positive variant given Bayes Theorem. Thepost-test-odds is pre-defined.

In some embodiments, the pre-test odds are pre-testprobability/(1−pre-test probability). The pre-test probability is theprobability of having a positive variant given the patient's cancer-typeand the prevalence of variant alterations within a genomic regionencompassing a candidate somatic sequence variant in a referencepopulation having the same cancer type.

In some embodiments, a pre-test-odds multiplier is applied to thepre-test odds for a resistance mutation that would develop and/or becomemore prominent within a heterogenous population of cancer cells inresponse to therapeutic treatment. The multiplier is applied to specificgenomic regions (e.g., exon windows) containing the resistance mutationposition. In some embodiments, the multiplier is only applied inspecified cancer contexts. For example, in some embodiments, amultiplier is applied to a pre-test odds for a genomic region containinga mutation that is resistant to at least one cancer therapy used totreat the type of cancer the subject has. For example, if a givenmutation is known to have resistance to a therapy used to treat breastcancer, but not to any of the therapies used to treat brain cancer, amultiplier will be applied to the pre-test odds for the genomic regionencompassing the mutation if the subject has breast cancer, but not ifthe subject has brain cancer.

In some embodiments, sensitivity is the fraction of variants detected bythe liquid biopsy assay at a given variant allele fraction (e.g., 0.1%,0.25%, 0.5%, etc.).

Calculating the pre-test probability. In some embodiments, the pre-testprobability is calculated using historical data for a set of referencesubjects having the same type of cancer, e.g., from sequencing of solidtumor samples. In this fashion, it is possible to accurately assess theprevalence of specific variants within the population of advanced humantumors. In some embodiments, the set of reference subjects is at least10 reference subjects. In some embodiments, the set of referencesubjects is at least 50 reference subjects. In some embodiments, the setof reference subjects is at least 100 reference subjects. In someembodiments, the set of reference subjects is at least 500 referencesubjects. In some embodiments, the set of reference subjects is at least1000 reference subjects. In some embodiments, the set of referencesubjects is at least 5000 reference subjects. In some embodiments, theset of reference subjects is at least 10000 reference subjects.

In some embodiments, variant prevalence is calculated by indexinggenomic regions (e.g., exons) in the reference sample and counting thenumber of variants in each genomic region (e.g., exon) for eachcancer-type. The number of patients who have at least one variant in thegenomic region (e.g., the exon)/the number of patients equals thevariant prevalence. The pre-test-odds are calculated from the prevalenceby pre-test-odds=prevalence/(1−prevalence).

In some embodiments, for a cancer where the number of patients in thereference is too low to calculate prevalence, a default pan cancercancer-type is used. Where no prevalence can be calculated, the meanvariant prevalence across cancer-types is used.

In some embodiments, pre-test-odds are not calculated each time an inputsample is run. Rather, in some embodiments, it is read from apre-existing file, which will be evaluated and regenerated if deemednecessary.

Calculating the pre-test-odds multiplier. Resistance mutations havehistorically low prevalence and variant allele fraction and mayincorrectly be filtered by the dynamic variant filtering method due tolow pre-test-odds caused by their inherent low prevalence in thepopulation. The resistance mutations develop in response to therapeutictreatment, and detecting resistance mutations early provides insightsinto the current treatment strategy. Low variant allele frequency, lowprevalence resistance mutations in historic solid tumor samples havebeen identified. The high sensitivity of the liquid biopsy assaydescribed herein permits the early detection of these resistancemutations in circulating DNA. Examples of such resistance mutationsinclude PIK3CA p.E545K in breast cancer, EGFR p.T790M in non-small celllung cancer, and AR p.H875Y for prostate cancer.

In some embodiments, to estimate the pre-test-odds-multiplier requiredto pass resistance mutations down to low variant allele fractions (e.g.,0.01% VAF, 0.05% VAF, 0.1% VAF, 0.15% VAF, 0.2% VAF, 0.25% VAF, 0.3%VAF, 0.4% VAF, 0.5% VAF, and the like), the average depth for eachvariant position is utilized from the reference pool (e.g., thereference pool used to determine the pre-test odds) depth, at a highminimum average depth (e.g., 500×, 1000×, 1500×, 2000×, 2500×, 3000×,4000×, 5000×, or more). For each resistance mutation, the number ofalternate alleles required to achieve the low VAF (e.g., 0.01% VAF,0.05% VAF, 0.1% VAF, 0.15% VAF, 0.2% VAF, 0.25% VAF, 0.3% VAF, 0.4% VAF,0.5% VAF, and the like) were calculated. The total alternate alleles anddepth for each resistance mutation was input to the Dynamic VariantFiltering method, and multipliers were applied until those resistancemutations passed the filtering strategy.

In some embodiments, the minimum multiplier required to pass resistancemutations is determined when the input sample alternate allele count isgreater than the background alternate allele count (as outlined inCalculating Testing Sample Alt Allele Count and Calculating BackgroundAlt Allele Count below). In some embodiments, the multiplier is selectedbased on the multiplier required to pass the variant at a low variantallele fraction (e.g., 0.1% VAF or 0.25% VAF). In some embodiments, amaximum value for the multiplier is applied, in order to preventexcessive artifacts from passing the filter. Large multipliers maypermit false positive variants to pass the Dynamic Variant Filteringmethod, however, large multipliers are necessary to pass resistancemutations that have historically low prevalence. In some embodiments,the maximum multiplier is between 750 and 1500. In some embodiments, themaximum multiplier is between 900 and 1100. In some embodiments, themaximum multiplier is between 1000 and 1050.

In some embodiments, the usage of the pre-test-odds-multiplier islimited by cancer-type context and genomic region (e.g., exon-window).In some embodiments, therefore, the multipliers will not be applied toall genomic regions (e.g., exon-windows) given a specified cancer-type,nor all cancer-types given a specific genomic region (e.g.,exon-window).

Calculating testing sample variant allele count. In some embodiments,the filtering method (the statistical method used for the DynamicVariant Filtering method) is selected from a beta-binomial distributionmodel, a beta distribution model, and a Poisson distribution model. Insome embodiments, the model is a beta-binomial model. In someembodiments, when applying a quantile beta-binomial distribution, thesum of the input sample alternate reads is divided by the input samplesequencing depth at each variant position, and then multiplied by thereference pool depth (the sequencing depth at genomic positions for apool of reference, e.g., healthy normal, controls). For additionalinformation on Beta-binomial models see, for example, Tripathi R. etal., “Estimation of Parameters in the Beta Binomial Model,” Ann. Inst.Statist. Math, 46(2):317-31 (1994), the disclosure of which isincorporated herein by reference in its entirety.

Calculating background variant allele count. In some embodiments, thebackground variant allele count calculation takes into account thebackground error from a pool of reference (e.g., healthy normalsubjects), the input sample error, and the prevalence of historicalvariants in the reference cancer subjects. The quantile beta-binomialmodel considers (i) reference pool depth (the sequencing depth atgenomic positions for a pool of reference, e.g., healthy normal,controls), background posterior error average from the input sample, andalpha calculated from the pre-test-odds, sensitivity, and thepost-test-odds (e.g., where alpha is equal to1−specificity=pre-test-odds*sensitivity/post-test-odds). For example, insome embodiments, the pre-test-odds calculated for a specific genomicregion (e.g., exon window) and cancer-type will yield a unique alpha foreach variant, given that the variants do not fall in the same genomicregion (e.g., exon window)).

In some embodiments, the background posterior error incorporates areaction-specific sequencing error rate (e.g., a trinucleotide erroraverage), a locus-specific, process-matched sequencing error rate (e.g.,the reference pool error, which is a sum of alternate reads for eachposition/depth from a pool of healthy normal controls), and a shrinkageweight parameter. In some embodiments, the reaction-specific sequencingerror rate (e.g., trinucleotide error average) is an aggregate of theinput sample background average, where the input sample backgroundaverage equals the error counts for each position divided by theposition-specific sequencing depth. In some embodiments, the samplebackground average is then aggregated for each trinucleotide context. Insome embodiments, the trinucleotide average is used to calculate theshrinkage weight parameter. In some embodiments, the shrinkage weightparameter equals the trinucleotide error average divided by the sum ofthe trinucleotide error average and the reference pool error. Ininstances when the shrinkage weight parameter is undefined, it ischanged to 1. In some embodiments, the final calculation of thebackground posterior error is calculated as:

background posterior error=shrinkage weight parameter*reaction-specificsequencing error rate(e.g., a trinucleotide error average)+(1−shrinkageweight parameter)*healthy subject error.

In some embodiments, a reference pool error can be used in place of aninput sample background average, for calculating the backgroundposterior average error rate.

In some embodiments, the alpha for the beta-binomial distribution iscalculated using the pre-test-odds, sensitivity, and post-test-odds,where:

alpha=1−specificity=pre-test-odds*sensitivity/post-test-odds

Accordingly, in some embodiments, the background posterior average, thereference pool depth, and the alpha are used in calculating the input tothe quantile beta-binomial function. The alpha is used in calculatingthe mean value of the beta-binomial distribution, which equals1−alpha/2. The size of the quantile beta-binomial is the matrix of thereference pool depth. The shape 1 parameter for the quantilebeta-binomial function is the reference pool depth multiplied by thebackground posterior average error rate, and the shape 2 parameter ofthe quantile beta-binomial function is the shape 1 parameter subtractedfrom reference pool depth.

The output from the quantile BetaBinomial function is the minimum valuea variant needs to be called. Any variant that has a normalized allelecount below the quantile(BetaBinomial) output will be filtered due tothe high background error observed at that position.

For example, FIG. 4F illustrates a flow chart of a method 400 forvalidating a somatic sequence variant in a test subject having a cancercondition, in accordance with some embodiments of the presentdisclosure.

In some embodiments, the method includes obtaining (402) cell-free DNAsequencing data 122 from a sequencing reaction of a liquid biopsy sampleof a test subject 121 (e.g., sequence reads 123-1-1-1, . . . ,123-1-1-Kfor sequence run 122-1-1 for a liquid biopsy sample from patient 121-1,as illustrated in FIG. 1B) As described herein, in some embodiments, theobtaining includes a step of sequencing cell-free nucleic acids from aliquid biopsy sample. Example methods for sequencing cell-free nucleicacids are described herein.

Sequence reads 123 from the sequencing data 122 are then aligned (404)to a human reference sequence (e.g., a human genome or a portion of ahuman genome, e.g., 1%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 75%,90%, 95%, 99%, or more of the human genome, or to a map of a humanreference genome or a set of human reference genomes, or a portionthereof), thereby generating a plurality of aligned reads 124.Optionally, the pre-aligned sequence reads 123 and/or aligned sequencereads 124 are pre-processed (408) using any of the methods disclosedabove (e.g., normalization, bias correction, etc.). In some embodiments,as described herein, device 100 obtains previously aligned sequencereads.

The aligned sequences reads 124 are then evaluated to identifymismatches with the reference construct (e.g., reference genome or setof reference genomes), thereby identifying one or more candidate somaticsequence variants 132-c at respective genomic loci. The number ofaligned sequence reads containing the sequence variant at the locus aredetermined, thereby defining a variant allele fragment count 132-c-ac(e.g., variant allele fragment count 132-c-1-ac as illustrated in FIG.1C). In some embodiments, the number of aligned sequence readscontaining the locus of the candidate variant allele (regardless of theidentity of the allele represented in the sequence read) are alsodetermined, thereby defining a variant allele locus count 132-c-lc(e.g., variant allele locus count 132-c-1-lc as illustrated in FIG. 1C).Accordingly, in some embodiments, the variant allele fragment count132-c-ac can be compared to the variant allele locus count 132-c-lc todetermine a variant allele fraction 132-c-vf (e.g., variant allelefraction 132-c-1-vf as illustrated in FIG. 1C) for the candidate variantallele. This represents a measure of the portion of sequence readsencompassing the nucleotide(s) that is altered in the candidate variantallele that include the candidate variant. In some embodiments, asdescribed below, this measure can be used to define a sensitivity forthe detection of the candidate variant based on a distribution ofdetection sensitivities corresponding to detection of a variant within agenomic region encompassing the locus in reference samples with definedvariant allele fractions.

Method 400 then includes obtaining (412) a dynamic variant countthreshold 191 for the candidate variant allele. As described herein, insome embodiments, the dynamic variant count threshold is based upon aprevalence of sequence variations in a genomic region encompassing thelocus of the candidate variant allele in cancer patients sharing one ormore similarities with the test subject. For example, in someembodiments, this prevalence defines a pre-test odds that the testsubject has a sequence variant within the genomic region encompassingthe locus at which the candidate sequence variant is located. In someembodiments, this pre-test odds is used in an application of Bayestheorem to derive a minimal amount of support required of the sequencingreaction to validate the presence of the candidate sequence variant in acancerous tissue of the subject at a desired confidence level.Information about Bayes theorem and Bayesian inference can be found, forinstance, in Section 8.7 of Stuart, A. and Ord, K. (1994), Kendall'sAdvanced Theory of Statistics: Volume I—Distribution Theory, EdwardArnold; and Gelman, A. et al., (2013), Bayesian Data Analysis, ThirdEdition, Chapman and Hall/CRC, ISBN 978-1-4398-4095-5, the disclosure ofboth of which are incorporated herein by reference for their teachingsof how to implement Bayes theorem and Bayesian inference.

In some embodiments, the prevalence of sequence variants in the genomicregion encompassing the locus of the candidate variant allele isdetermined from a population of reference cancer subjects having thesame type of cancer. In some embodiments, the population of referencecancer subjects is further defined by a matching personalcharacteristic, e.g., an age, gender, race, smoking status, or any otherpersonal characteristic. In some embodiments, the population ofreference subjects is further defined by a plurality of matchingpersonal characteristics, e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, ormore person characteristics, in addition to cancer type.

For instance, in some embodiments, the prevalence of sequence variantsis determined from variant prevalence training data 192, as illustratedin FIG. 1F. The variant prevalence training data 192 includes data onthe variants found in a cancerous tissue from a plurality of referencesubjects 193. For example, training data 192 for reference subject 1193-1 includes a cancer type 194-1 and a list of somatic sequencevariants 195-1, including individual variants 196-1-1 . . . 196-1-S. Todetermine a prevalence for a particular candidate sequence variantdetected for a test subject, a genomic region encompassing the locus ofthe candidate sequence variant is defined (e.g., the exon of a gene inwhich a candidate sequence variant is detected). Then, it is determinedwhat portion of reference subjects 193, that have the same cancer as thetest subject, have a sequence variant located within the defined genomicregion (e.g., the exon of the gene).

In some embodiments, e.g., when only a limited set of defined candidatevariants will be validated, sequence variant prevalence is predeterminedand stored in a database, e.g., in non-persistent memory 111, or in anaddressable remote server, as a look-up table. In other embodiments,system 100 determines a sequence variant prevalence for a genomic regionand matching patient profile upon identification of a candidate sequencevariant, e.g., by filtering variant prevalence training data 192 for therelevant genomic region and matching reference subjects.

Generally, the genomic region encompassing the candidate sequencevariant is larger than a single nucleotide. For example, in someembodiments, the genomic region includes at least 10 nucleotides, atleast 50 nucleotides, at least 100 nucleotides, at least 250nucleotides, at least 500 nucleotides, at least 1000 nucleotides, atleast 2500 nucleotides, or more nucleotides. In some embodiments, thegenomic region is no larger than 10,000 nucleotides, not larger than7500 nucleotides, no larger than 5000 nucleotides, no larger than 2500nucleotides, or fewer nucleotides. In some embodiments, the genomicregion is from 10 nucleotides to 10,000 nucleotides. In someembodiments, the genomic region is from 25 nucleotides to 5000nucleotides. In some embodiments, the genomic region is from 50nucleotides to 2500 nucleotides.

In some embodiments, when the candidate sequence variant falls within aprotein coding sequence, the genomic region is defined as the exon inwhich the candidate sequence variant is located. In some embodiments,the genomic region is defined as several adjacent exons, including theexon in which the candidate sequence variant is located. In someembodiments, when the candidate sequence variant falls within a proteincoding sequence, the genomic region is defined as all exons of the genein which the candidate sequence variant is located. In some embodiments,when the candidate sequence variant falls within a protein codingsequence, the genomic region is defined as the entire gene in which thecandidate sequence variant is located. Similarly, in some embodiments,when the candidate sequence variant falls within an intronic sequence ofa gene, the genomic region is defined as the entire intron in which thecandidate sequence variant is located, or several adjacent intronsincluding the intron in which the candidate sequence variant is located.

In some embodiments, the genomic region encompassing the candidatesequence variant is a fixed window encompassing, e.g., surrounding, thecandidate sequence variant. For example, in some embodiments, when thecandidate sequence variant falls within a non-coding portion of thegenome, the genomic region is defined as a fixed window surrounding thecandidate sequence variant. However, in some embodiments, when thesequence variant falls within a non-coding genetic element, e.g., apromoter, enhancer, etc., the genomic region is defined as the entiretyof the genetic element.

In some implementations, the genomic region encompassing the candidatesequence variant is dependent upon the sequence context of the locus.For example, when the candidate sequence variant falls within a codingsequence, the exon or several adjacent exons defines the genomic region,but when the candidate sequence variant falls within a non-codingsequence, the genomic region is defined by a fixed window encompassingthe candidate sequence variant.

In some embodiments, the genomic region encompassing the candidatesequence variant is dependent upon a known or inferred effect of thesequence variant. For instance, as described in more detail below, insome embodiments, when the candidate sequence variant causes, or isinferred to cause, a partial or complete loss of function mutation in agene, the genomic region is defined by all exons of the gene in whichthe candidate sequence variant is located. Similarly, as described inmore detail below, in some embodiments, when the candidate sequencevariant causes, or is inferred to cause, a gain of function mutation ina gene having one or more hotspots for gain of function mutations, thegenomic region is defined as those exons of the gene encompassing theone or more hotspots.

In some embodiments, when the candidate sequence variant falls within agenomic region associated with a known therapeutic resistance gene forthe cancer of the subject, the pre-test odds determined based on thehistorical prevalence data is multiplied by a pre-test-odds multiplier(e.g., as described above).

In some embodiments, the Bayesian analysis is further informed bydefining the specificity of variant detection based on an apparentvariant allele fraction in the sample. For example, in some embodiments,the variant allele fraction for the candidate sequence variant isdetermined by a comparison of the variant allele fragment count 132-c-acto the variant allele locus count 132-c-lc (e.g., a ratio of the variantallele fragment count to the variant allele locus count), therebydetermining a variant allele fraction 132-c-vf. In some embodiments, thevariant allele fraction is then compared to a distribution of variantdetection specificities established based on a set of training samples(e.g., sensitivity distribution training data) with known variant allelefractions. For example, in some embodiments, nucleic acids from each ofa plurality of training samples 181 having a known variant allelefraction 184 for one or more variant alleles 183 is sequenced accordingto a processed-matched sequencing reaction (e.g., using a substantiallyidentical or identical sequencing reaction), and it is determinedwhether each sequence variant can be detected, e.g., defining adetection status 185 for each locus/variant 183. Over a large number oftraining samples, a specificity of detection of variants havingdifferent variant allele fractions can be determined. In someembodiments, the specificity is determined on a locus-by-locus basis,such that the specificity of detection is specific for the genomicregion or locus encompassing the candidate sequence variant. In someembodiments, the specificity is determined globally, e.g., not on alocus-by-locus basis.

A correlation can then be established between the measured detectionspecificity and the variant allele fraction (e.g., variant detectionsensitivity distribution 186). In some embodiments, the correlation is alinear or non-linear fit between measured detection specificities andvariant allele fractions. In other embodiments, the correlation isdetermined by binning specificities (e.g., in bins 187) as a function ofranges of variant allele fractions 188, and determining a measure ofcentral tendency (e.g., a mean) for the specificities 189 in the bin.The variant allele fraction 132-c-ac determined for the candidatesequence variant is then compared to the established correlation (e.g.,variant detection sensitivity distribution 186) to define thespecificity of detection for the candidate sequence variant.

In some embodiments, the Bayesian analysis is further informed byaccounting for the sequencing error rate for the variant allele and,accordingly, the probability that the candidate sequence variant is aproduct of a sequencing error, rather than a genomic variant. In someembodiments, a reaction-specific error rate (e.g., a trinucleotidesequencing error rate) is determined for the sequencing reaction (e.g.,using an internal control spiked into the reaction). In someembodiments, a locus-specific error rate is determined from historicalsequencing errors at the genomic region, or specific locus, encompassingthe candidate sequence variant. In some embodiments, both areaction-specific sequencing error rate and a locus-specific error rateare used to define a variant count distribution (e.g., variant countdistribution 190), representing the number of variant allele counts(e.g., variant allele fragment count 132-c-ac) necessary to validate thepresence of the candidate variant sequence in the cancer of the subjectat a defined detection sensitivity. In some embodiments, a distribution,such as a beta binomial distribution, is established based on thereaction-specific sequencing error rate and the locus-specific errorrate.

Method 400 then includes applying (414) the dynamic variant countthreshold (e.g., locus-specific dynamic variant count threshold 191) tothe sequencing data, e.g., by determining whether the variant allelefragment count 132-c-ac for the candidate sequence variant satisfies thethreshold, and validating the candidate sequence variant (e.g., creatinga record 132-v of the validation) when the threshold is satisfied orrejecting the candidate sequence variant when the threshold is notsatisfied. In some embodiments, one or more additional filters, relatingto global sequencing metrics and/or locus-specific sequencing metrics(e.g., one or more of variant locus coverage filter(s) 463, variantallele fraction filter(s) 465, variant support mapping filter(s) 467,variant support sequencing quality filter(s) 469, and low complexityregion filter(s) 471, as illustrated in FIG. 1D) must be satisfiedbefore validating a candidate sequence variant.

As described in further detail herein, in some embodiments, one or morevalidated variant statuses 132-v are used to match (424) the subjectwith a targeted therapy and/or a clinical trial. In some embodiments, asdescribed in further detail herein, one or more validated variantstatuses 132-v for one or more actionable variants 139-1-1, one or morematched therapies 139-1-2, and/or one or more matched clinical trialsare used to generate (426) a patient report 139-1-3. In someembodiments, the patient report is transmitted to a medical professionaltreating the subject. In some embodiments, the patient is thenadministered (428) a personalized course of therapy, e.g., based on amatched therapy and/or clinical trial.

In some embodiments, the methods of validating a candidate somaticsequence variant using a dynamic threshold described herein fall withinthe context of a larger variant detection method, e.g., as illustratedby method 450 illustrated in FIGS. 4G1-4G3. For example, in someembodiments, the method includes obtaining (452) cfDNA sequence reads,as described herein, and aligning (454) those reads to a referenceconstruct (e.g., a reference genome or mapped representation of severalreference genomes), to generate aligned sequences 124 (e.g., a pluralityof unique sequence reads). In some embodiments, putative somaticsequence variants are identified (456), e.g., those sequence variantshaving a variant allele fraction that is lower than expected for agermline sequence variant (which should be around 50% after accountingfor an estimated circulating tumor fraction for the liquid biopsysample), e.g., less than 30%, less than 20%, less than 10% etc. One ormore candidate somatic sequence variants are then validated by applyingone or more filters. For instance, as described herein, a dynamicvariant count threshold is determined (459) and then used to apply (460)a dynamic probabilistic variant count filter to sequencing data for thecandidate somatic sequence variant. In some embodiments, the method alsoincludes applying (462) a variant loci coverage filter. In someembodiments, the method also includes applying (464) a variant allelefraction filter. In some embodiments, the method also includes applying(466) a variant support mapping filter. In some embodiments, the methodalso includes applying (468) a variant support sequencing qualityfilter. In some embodiments, the method also includes applying (470) alow complexity region filter. When all selected candidate somaticsequence variants have been validated or rejected according to thesefilters (472), the process proceeds with a reporting function.

In some embodiments, method 450 also includes validating (474) thesequencing data globally, using any of the metrics described herein. Insome embodiments, the validation includes applying (476) a loci minimalcoverage filter. In some embodiments, the validation includes applying(478) a loci central tendency coverage filter. In some embodiments, thevalidation includes applying (480) a total sequence read filter. In someembodiments, the validation includes applying (481) a sequence readquality filter. In some embodiments, the validation includes applying asequencing control filter (482). The entire sequencing reaction is thenvalidated or rejected (483) based on whether the sequencing data passesthese global filters.

In some embodiments, method 450 also includes validating (485) one ormore germline mutations. In some embodiments, candidate germlinesequence variants are identified (484), e.g., those sequence variantshaving a variant allele fraction that is higher than expected for asomatic sequence variant. In some embodiments, the validation includesapplying (486) a germline-specific variant allele fraction filter. Insome embodiments, the validation includes applying (487) a variantsupport mapping filter. In some embodiments, the validation includesapplying (488) a variant support sequencing quality filter. When allselected candidate germline sequence variants have been validated orrejected according to these filters (489), the process proceeds with areporting function.

As described in further detail herein, in some embodiments, one or morevalidated variant statuses 132-v are used to match (490) the subjectwith a targeted therapy and/or a clinical trial. In some embodiments, asdescribed in further detail herein, one or more validated variantstatuses 132-v for one or more actionable variants 139-1-1, one or morematched therapies 139-1-2, and/or one or more matched clinical trialsare used to generate (492) a patient report 139-1-3. In someembodiments, the patient report is transmitted to a medical professionaltreating the subject. In some embodiments, the patient is thenadministered (494) a personalized course of therapy, e.g., based on amatched therapy and/or clinical trial.

In some embodiments, all, or nearly all, of the aligned sequence readsare evaluated to identify candidate sequence variants (e.g., candidatesomatic sequence variants and/or candidate germline sequence variants).In other embodiments, a subset of the aligned sequence reads isevaluated to identify candidate sequence variants. For example, in oneembodiment, targeted-panel sequencing reaction is used to generatesequencing data 122 and only sequence reads corresponding to the targetpanel (on-target reads) are evaluated to identify candidate sequencevariants. In some embodiments, targeted-panel sequencing reaction isused to generate sequencing data 122 and a subset of sequence readscorresponding to a subset of the target panel are evaluated to identifycandidate sequence variants. In some embodiments, a subset of thesequence reads corresponding to a subset of genes, regardless of whetherthe sequencing reaction is a targeted-panel sequencing reaction, a wholeexome sequencing reaction, or a whole genome sequencing reaction, areevaluated to identify candidate sequence variants. In some embodiments,a subset of sequence reads corresponding to a defined set of regionswithin the genome, e.g., one or more genes, one or more introns, one ormore exons, one or more subregion of an intron and/or exon associatedwith cancer etiology, etc., are evaluated to identify candidate sequencevariants.

Alternatively, in some embodiments, regardless of what subset of alignedsequence reads are evaluated to identify candidate sequence variants,only a subset of candidate sequence variants is further validated. Forexample, in some embodiments, only candidate sequence variantscorresponding to the target panel (on-target reads) are validated.Similarly, in some embodiments, only candidate sequence variantscorresponding to a subset of the target panel are validated. Likewise,in some embodiments, only candidate sequence variants corresponding to asubset of genes, regardless of whether the sequencing reaction is atargeted-panel sequencing reaction, a whole exome sequencing reaction,or a whole genome sequencing reaction, are validated. Similarly, in someembodiments, only candidate variants corresponding to a defined set ofregions within the genome, e.g., one or more genes, one or more introns,one or more exons, one or more subregion of an intron and/or exonassociated with cancer etiology, etc., are validated.

In some embodiments, different sets of sequence variants are evaluateddepending on the type of cancer being evaluated. That is, when thesubject has a first type of cancer, candidate sequence variants in afirst set of genomic loci are evaluated, typically associated with theetiology of the first type cancer and/or a particular course ofactionable therapy for the first type cancer, and when the subject has asecond type of cancer, candidate sequence variants in a second set ofgenomic loci are evaluated, typically associated with the etiology ofthe second type cancer and/or a particular course of actionable therapyfor the second type of cancer. These selections may be applied at thelevel of initial sequence read evaluation (e.g., only sequence readscorresponding to a defined set of loci are evaluated to identify acandidate sequence variant) or the validation level (e.g., sequencereads corresponding to a larger set of loci are evaluated to identifycandidate sequence variants, but only those candidates corresponding toa defined set are further validated).

Similarly, in some embodiments, for one or more target loci fallingwithin a gene exon, only candidate sequence variants that would resultin an amino acid change in the amino acid sequence encoded by the geneare evaluated. In some embodiments, any candidate sequence variantresulting in an amino acid change are evaluated. In some embodiments,candidate sequence variants resulting in a defined amino acid change,e.g., an amino acid change associated with cancer etiology and/or aparticular actionable cancer therapy, are evaluated. In someembodiments, only a subset of validated sequence variants is included ona clinical report for the sample. That is, in some embodiments, alignedsequence reads corresponding to all or a subset of genomic loci areevaluated to identify candidate sequence variants, all or a subset ofidentified candidate sequence variants are evaluated for validation, andonly a subset of all possibly validated sequence variants are includedon a clinical report generated for the sample.

For example, lists of example candidate sequence variants for evaluationin breast cancer, non-small cell lung cancer, prostate cancer, pancancer, and cancer of unknown origin are provided below. Standardnomenclature is used to describe chromosomal location and specific aminoacid variants, as described further by the Human Genome VariationSociety, e.g., at the URLvarnomen.hgvs.org/recommendations/protein/variant/substitution/.

For example, in some embodiments, the subject has breast cancer andcandidate variants associated with at least one of the following genesand/or genetic loci are evaluated: ERBB2 (or a genetic locus including achromosomal position of 17:37880220 and/or 17:37881064), EGFR (or agenetic locus including a chromosomal position of 7:55227926,7:55242511, and/or 7:55249022), ESR1 (or a genetic locus including achromosomal position of 6:152419922, 6:152419923 and/or 6:152419926),KRAS (or a genetic locus including a chromosomal position of12:25380275, 12:25380276, 12:25380277, and/or 12:25380279), MAP2K1 (or agenetic locus including a chromosomal position of 15:66729162 and/or15:66729163), MET (or a genetic locus including a chromosomal positionof 7:116422117 and/or 7:116423413); MTOR (or a genetic locus including achromosomal position of 1:11187094, 1:11187096, and/or 1:11187796),NTRK1 (or a genetic locus including a chromosomal position of1:156846342, 1:156849044 and/or 1:156849144), and PIK3CA (or a geneticlocus including a chromosomal position of 3:178936082, 3:178936091,3:178936092, 3:178936093, 3:178952084, and/or 3:178952085). In someembodiments, the subject has breast cancer and candidate variantsassociated with at least 2, at least 3, at least 4, at least 5, at least6, at least 7, or at least 8 of the genes listed above (or lociincluding the enumerated corresponding chromosomal positions) areevaluated. In some embodiments, the subject has breast cancer andcandidate variants associated with any of the genes listed above (orloci including the enumerated corresponding chromosomal positions) areevaluated.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in theERBB2 gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the ERBB2 geneincludes variants resulting in an amino acid change selected from L755*,L755S, L755W, T798I, T798K, and T798R.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in theEGFR gene are evaluated and/or reported. In some embodiments, the subsetof possible candidate sequence variants in the EGFR gene includesvariants resulting in an amino acid change selected from G465*, G465R,D761H, D761N, D761Y, V774L, and V774M.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in theESR1 gene are evaluated and/or reported. In some embodiments, the subsetof possible candidate sequence variants in the ESR1 gene includesvariants resulting in an amino acid change selected from Y537D, Y537H,Y537N, Y537C, Y537S, Y537F, D538A, D538G, and D538V.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in theKRAS gene are evaluated and/or reported. In some embodiments, the subsetof possible candidate sequence variants in the KRAS gene includesvariants resulting in an amino acid change selected from G60D, Q61H,Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in theMAP2K1 gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the MAP2K1 geneincludes variants resulting in an amino acid change selected from P124A,P124S, P124T, P124R, P124L, P124Q.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in the METgene are evaluated and/or reported. In some embodiments, the subset ofpossible candidate sequence variants in the MET gene includes variantsresulting in an amino acid change selected from F1200I, F1200L, F1200V,Y1230D, Y1230H, and Y1230N.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in theMTOR gene are evaluated and/or reported. In some embodiments, the subsetof possible candidate sequence variants in the MTOR gene includesvariants resulting in an amino acid change selected from A2034E, A2034G,A2034V, F2108F, F2108I, F2108L, and F2108V.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in theNTRK1 gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the NTRK1 geneincludes variants resulting in an amino acid change selected from G595R,G595W, F646I, F646L, F646V, D679A, D679G, and D679V.

In some of the embodiments described above where the subject has breastcancer, only a subset of possible candidate sequence variants in thePIK3CA gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the PIK3CA geneincludes variants resulting in an amino acid change selected from E542K,E545*, E545K, E545Q, E545A, E545G, E545V, E545D, E545E, H1047D, H1047Y,H1047N, H1047L, H1047P, H1047R.

Similarly, in some embodiments, the subject has non-small cell lungcancer and candidate variants associated with at least one of thefollowing genes and/or genetic loci are evaluated: ALK (or a geneticlocus including a chromosomal position of 2:29443613, 2:29443631,2:29443695, 2:29443697, 2:29445213, and/or 2:29445258), B2M (or agenetic locus including a chromosomal position of 15:45003745), BRAF (ora genetic locus including a chromosomal position of 7:140453135,7:140453136, and/or 7:140453137), EGFR (or a genetic locus including achromosomal position of 7:55227926, 7:55241704, 7:55241705, 7:55241706,7:55242469, 7:55242511, 7:55249022, 7:55249071, 7:55249091, 7:55249092,7:55249093, 7:55249094, and/or 7:55259515), ERBB2 (or a genetic locusincluding a chromosomal position of 17:37880220), KRAS (or a geneticlocus including a chromosomal position of 12:25378562, 12:25378643,12:25380275, 12:25380276, 12:25380277, 12:25380279, 12:25398255,12:25398280, 12:25398281, 12:25398282, 12:25398283, 12:25398284, and/or12:25398285), MAP2K1 (or a genetic locus including a chromosomalposition of 15:66729162 and/or 15:66729163), MET (or a genetic locusincluding a chromosomal position of 7:116422117 and/or 7:116423413),NTRK1 (or a genetic locus including a chromosomal position of1:156846342, 1:156849044, and/or 1:156849144), PIK3CA (or a geneticlocus including a chromosomal position of 3:178936091, 3:178936092,3:178936093, 3:178952072, 3:178952084, and/or 3:178952085), and STK11(or a genetic locus including a chromosomal position of 19:1218483,19:1220370, 19:1220487, 19:1220629, and/or 19:1220649). In someembodiments, the subject has non-small cell lung cancer and candidatevariants associated with at least 2, at least 3, at least 4, at least 5,at least 6, at least 7, at least 8, at least 9, or at least 10 of thegenes listed above (or loci including the enumerated correspondingchromosomal positions) are evaluated. In some embodiments, the subjecthas non-small cell lung cancer and candidate variants associated withany of the genes listed above (or loci including the enumeratedcorresponding chromosomal positions) are evaluated.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the ALK gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theALK gene includes variants resulting in an amino acid change selectedfrom G1202*, G1202R, L1196L, L1196M, L1196V, F1174F, F1174L, F1174I,F1174V, 11171N, 11171S, 11171T, C1156F, C1156S, and C1156Y.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the BRAF gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theBRAF gene includes variants resulting in an amino acid change selectedfrom V600*, V600A, V600E, V600G, V600L, and V600M.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the EGFR gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theEGFR gene includes variants resulting in an amino acid change selectedfrom G465*, G465R, L718L, L718M, L718V, L718P, L718Q, L718R, L747I,L747L, L747V, D761H, D761N, D761Y, V774L, V774M, T790K, T790M, T790R,C797G, C797R, C797S, C797F, C797Y, C797*, C797C, C797W, L798F, L798I,L798V, L858P, L858Q, and L858R.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the ERBB2 gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theERBB2 gene includes variants resulting in an amino acid change selectedfrom L755*, L755S, and L755W.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the KRAS gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theKRAS gene includes variants resulting in an amino acid change selectedfrom A146T, D119N, Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, Q61K, G60V,Q22K, G13G, G13A, G13V, G13D, G13C, G13R, G13S, G12G, G12A, G12V, G12D,G12C, G12R, and G12S.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the MAP2K1 gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theMAP2K1 gene includes variants resulting in an amino acid change selectedfrom P124A, P124S, P124T, P124R, P124L, and P124Q.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the MET gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theMET gene includes variants resulting in an amino acid change selectedfrom F1200I, F1200L, F1200V, Y1230D, Y1230H, and Y1230N.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the NTRK1 gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theNTRK1 gene includes variants resulting in an amino acid change selectedfrom G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the PIK3CA gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in thePIK3CA gene includes variants resulting in an amino acid change selectedfrom E545*, E545K, E545Q, E545A, E545G, E545V, E545D, E545E, M1043V,H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.

In some of the embodiments described above where the subject hasnon-small cell lung cancer, only a subset of possible candidate sequencevariants in the STK11 gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theSTK11 gene includes variants resulting in an amino acid change selectedfrom E120*, D194Y, S216F, and E223*, as well as nucleotide substitutionc.465-2A>T.

Similarly, in some embodiments, the subject has prostate cancer andcandidate variants associated with at least one of the following genesand/or genetic loci are evaluated: AR (or a genetic locus including achromosomal position of X:66766292, X:66931463, X:66931504, X:66937370,X:66937371, X:66937372, X:66943543, X:66943549, and/or X:66943552), EGFR(or a genetic locus including a chromosomal position of 7:55227926,7:55242511, and/or 7:55249022), ERBB2 (or a genetic locus including achromosomal position of 17:37880220), KRAS (or a genetic locus includinga chromosomal position of 12:25380275, 12:25380276, and/or 12:25380277),MAP2K1 (or a genetic locus including a chromosomal position of15:66729162 and/or 15:66729163), MET (or a genetic locus including achromosomal position of 7:116422117 and/or 7:116423413), NTRK1 (or agenetic locus including a chromosomal position of 1:156846342,1:156849044, and/or 1:156849144), and PIK3CA (or a genetic locusincluding a chromosomal position of 3:178952084 and/or 3:178952085). Insome embodiments, the subject has prostate cancer and candidate variantsassociated with at least 2, at least 3, at least 4, at least 5, at least6, or at least 7 of the genes listed above (or loci including theenumerated corresponding chromosomal positions) are evaluated. In someembodiments, the subject has prostate cancer and candidate variantsassociated with any of the genes listed above (or loci including theenumerated corresponding chromosomal positions) are evaluated.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the AR gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the AR gene includesvariants resulting in an amino acid change selected from W435L, L702H,L702P, L702R, V716M, W742G, W742R, W742*, W742L, W742S, W742C, H875Y,F877L, T878A, T878P, and T878S.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the EGFR gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the EGFR gene includesvariants resulting in an amino acid change selected from G465*, G465R,D761H, D761N, D761Y, V774L, and V774M.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the ERBB2 gene are evaluated and/or reported. In some embodiments,the subset of possible candidate sequence variants in the ERBB2 geneincludes variants resulting in an amino acid change selected from L755*,L755S, and L755W.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the KRAS gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the KRAS gene includesvariants resulting in an amino acid change selected from Q61H, Q61Q,Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the MAP2K1 gene are evaluated and/or reported. In some embodiments,the subset of possible candidate sequence variants in the MAP2K1 geneincludes variants resulting in an amino acid change selected from P124A,P124S, P124T, P124R, P124L, and P124Q.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the MET gene are evaluated and/or reported. In some embodiments, thesubset of possible candidate sequence variants in the MET gene includesvariants resulting in an amino acid change selected from F1200I, F1200L,F1200V, Y1230D, Y1230H, and Y1230N.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the NTRK1 gene are evaluated and/or reported. In some embodiments,the subset of possible candidate sequence variants in the NTRK1 geneincludes variants resulting in an amino acid change selected from G595R,G595W, F646I, F646L, F646V, D679A, D679G, and D679V.

In some of the embodiments described above where the subject hasprostate cancer, only a subset of possible candidate sequence variantsin the PIK3CA gene are evaluated and/or reported. In some embodiments,the subset of possible candidate sequence variants in the PIK3CA geneincludes variants resulting in an amino acid change selected fromH1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.

In one example, the cancer condition is any type of cancer (for example,pan-cancer) and the somatic variants validated by this method includevariants associated with any of the following genes: EGFR (or a geneticlocus including a chromosomal position of 7:55227926, 7:55242511, and/or7:55249022), ERBB2 (or a genetic locus including a chromosomal positionof 17:37880220), KRAS (or a genetic locus including a chromosomalposition of 12:25380275, 12:25380276, and/or 12:25380277), MAP2K1 (or agenetic locus including a chromosomal position of 15:66729162 and/or15:66729163), MET (or a genetic locus including a chromosomal positionof 7:116422117 and/or 7:116423413), NTRK1 (or a genetic locus includinga chromosomal position of 1:156846342, 1:156849044, and/or 1:156849144),PIK3CA (or a genetic locus including a chromosomal position of3:178952084 and/or 3:178952085), and TP53. In some embodiments, thesubject has any cancer (e.g., pan cancer) and candidate variantsassociated with at least 2, at least 3, at least 4, at least 5, at least6, or at least 7 of the genes listed above (or loci including theenumerated corresponding chromosomal positions) are evaluated. In someembodiments, the subject has any cancer (e.g., pan cancer) and candidatevariants associated with any of the genes listed above (or lociincluding the enumerated corresponding chromosomal positions) areevaluated.

In some of the embodiments described above where the subject has anytype of cancer (e.g., pan-cancer), only a subset of possible candidatesequence variants in the EGFR gene are evaluated and/or reported. Insome embodiments, the subset of possible candidate sequence variants inthe EGFR gene includes variants resulting in an amino acid changeselected from G465*, G465R, D761H, D761N, D761Y, V774L, and V774M.

In some of the embodiments described above where the subject has anytype of cancer (e.g., pan-cancer), only a subset of possible candidatesequence variants in the ERBB2 gene are evaluated and/or reported. Insome embodiments, the subset of possible candidate sequence variants inthe ERBB2 gene includes variants resulting in an amino acid changeselected from L755*, L755S, and L755W.

In some of the embodiments described above where the subject has anytype of cancer (e.g., pan-cancer), only a subset of possible candidatesequence variants in the KRAS gene are evaluated and/or reported. Insome embodiments, the subset of possible candidate sequence variants inthe KRAS gene includes variants resulting in an amino acid changeselected from Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.

In some of the embodiments described above where the subject has anytype of cancer (e.g., pan-cancer), only a subset of possible candidatesequence variants in the MAP2K1 gene are evaluated and/or reported. Insome embodiments, the subset of possible candidate sequence variants inthe MAP2K1 gene includes variants resulting in an amino acid changeselected from P124A, P124S, P124T, P124R, P124L, and P124Q.

In some of the embodiments described above where the subject has anytype of cancer (e.g., pan-cancer), only a subset of possible candidatesequence variants in the MET gene are evaluated and/or reported. In someembodiments, the subset of possible candidate sequence variants in theMET gene includes variants resulting in an amino acid change selectedfrom F1200I, F1200L, F1200V, Y1230D, Y1230H, and Y1230N.

In some of the embodiments described above where the subject has anytype of cancer (e.g., pan-cancer), only a subset of possible candidatesequence variants in the NTRK1 gene are evaluated and/or reported. Insome embodiments, the subset of possible candidate sequence variants inthe NTRK1 gene includes variants resulting in an amino acid changeselected from G595R, G595W, F646I, F646L, F646V, D679A, D679G, andD679V.

In some of the embodiments described above where the subject has anytype of cancer (e.g., pan-cancer), only a subset of possible candidatesequence variants in the PIK3CA gene are evaluated and/or reported. Insome embodiments, the subset of possible candidate sequence variants inthe PIK3CA gene includes variants resulting in an amino acid changeselected from H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.

Similarly, in some embodiments, the subject has a tumor of unknownorigin or a cancer of unknown primary and candidate variants associatedwith at least one of the following genes and/or genetic loci areevaluated: EGFR (or a genetic locus including a chromosomal position of7:55227926, 7:55242511, and/or 7:55249022), ERBB2 (or a genetic locusincluding a chromosomal position of 17:37880220), KRAS (or a geneticlocus including a chromosomal position of 12:25380275, 12:25380276,12:25380277, and/or 12:25398255), MAP2K1 (or a genetic locus including achromosomal position of 15:66729162 and/or 15:66729163), MET (or agenetic locus including a chromosomal position of 7:116422117 and/or7:116423413), NRAS (or a genetic locus including a chromosomal positionof 1:115258748), NTRK1 (or a genetic locus including a chromosomalposition of 1:156846342, 1:156849044, and/or 1:156849144), PIK3CA (or agenetic locus including a chromosomal position of 3:178927980,3:178952084 and/or 3:178952085), and TP53. In some embodiments, thesubject has any cancer (e.g., pan cancer) and candidate variantsassociated with at least 2, at least 3, at least 4, at least 5, at least6, at least 7, or at least 8 of the genes listed above (or lociincluding the enumerated corresponding chromosomal positions) areevaluated. In some embodiments, the subject has any cancer (e.g., pancancer) and candidate variants associated with any of the genes listedabove (or loci including the enumerated corresponding chromosomalpositions) are evaluated.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the EGFR gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the EGFR gene includes variants resulting in anamino acid change selected from G465*, G465R, D761H, D761N, D761Y,V774L, and V774M.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the ERBB2 gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the ERBB2 gene includes variants resulting in anamino acid change selected from L755*, L755S, and L755W.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the KRAS gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the KRAS gene includes variants resulting in anamino acid change selected from Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*,Q61E, Q61K, and Q22K.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the MAP2K1 gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the MAP2K1 gene includes variants resulting in anamino acid change selected from P124A, P124S, P124T, P124R, P124L, andP124Q.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the MET gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the MET gene includes variants resulting in anamino acid change selected from F1200I, F1200L, F1200V, Y1230D, Y1230H,and Y1230N.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the NRAS gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the NRAS gene includes variants resulting in anamino acid change of G12S.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the NTRK1 gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the NTRK1 gene includes variants resulting in anamino acid change selected from G595R, G595W, F646I, F646L, F646V,D679A, D679G, and D679V.

In some of the embodiments described above where the subject has a tumorof unknown origin or a cancer of unknown primary, only a subset ofpossible candidate sequence variants in the PIK3CA gene are evaluatedand/or reported. In some embodiments, the subset of possible candidatesequence variants in the PIK3CA gene includes variants resulting in anamino acid change selected from C420R, H1047D, H1047Y, H1047N, H1047L,H1047P, and H1047R.

In other embodiments, the cancer condition is acute myeloid leukemia,adrenal cancer, b cell lymphoma, basal cell carcinoma, biliary cancer,bladder cancer, brain cancer, breast cancer, cervical cancer,chromophobe renal cell carcinoma, clear cell renal cell carcinoma,colorectal cancer, confirm at path review (cancer type unconfirmed),endocrine tumor, endometrial cancer, esophageal cancer, gastric cancer,gastrointestinal stromal tumor, glioblastoma, head and neck cancer, headand neck squamous cell carcinoma, heme other, high-grade glioma, kidneycancer, liver cancer, low grade glioma, medulloblastoma, melanoma,meningioma, mesothelioma, multiple myeloma, neuroblastoma, non-clearcell renal cell carcinoma, non-small cell lung cancer, oropharyngealcancer, ovarian cancer, pan-cancer, pancreatic cancer, peritonealcancer, prostate cancer, sarcoma, skin cancer, small cell lung cancer, tcell lymphoma, testicular cancer, thymoma, thyroid cancer, tumor ofunknown origin, or uveal melanoma.

In some embodiments, certain variants pre-identified on a whitelist maybe rescued, e.g., not filtered out, when they fail to pass selectivefilters, e.g., MSI/SN, a Bayesian filtering method, and/or a coverage,VAF or region-based filter. The rationale for whitelisting a variant isto apply less stringent filtering criteria to such a variant so that itcan be reviewed and/or reported. In some embodiments, one or morevariant on the whitelist is a common pathogenic variant, e.g., with highclinical relevance. In this fashion, when a variant on the whitelistfails to pass certain filters, it will be rescued and not filtered out.As used herein, MSI/SN refers to a variant filter for filtering outpotential artifactual variants based on the MSI (microsatelliteinstable) and SN (signal-to-noise ratio) values calculated by thevariant caller VarDict. See, for example, VarDict documentation,available on the internet at github.com/AstraZeneca-NGSNarDictJava.

In some embodiments, one or more locus and/or genomic region isblacklisted, preventing somatic variant annotation for variantsidentified at the locus or region. In some embodiments, the variant hasa length of 120, 100, 80, 60, 40, 20, 10, 5 or less base pairs. Invarious embodiments, any combination of the additional criteria, as wellas additional criteria not listed above, may be applied to the variantcalling process. Again, in some embodiments, different criteria areapplied to the annotation of different types of variants.

In some embodiments, liquid biopsy assays are used to detect variantalterations present at low circulating fractions in the patient's blood.In such circumstances, it may be warranted to lower the requirements forpositively identifying a variant. That is, in some embodiments, lowlevels of support may be sufficient to call a variant, dependent uponthe reason for using the liquid biopsy assay.

In some embodiments, SNV/INDEL detection is accomplished using VarDict(github.com/AstraZeneca-NGS/VarDictJava). Both SNVs and INDELs arecalled and then sorted, deduplicated, normalized and annotated. Theannotation uses SnpEff to add transcript information, 1000 genomes minorallele frequencies, COSMIC reference names and counts, ExAC allelefrequencies, and Kaviar population allele frequencies. The annotatedvariants are then classified as germline, somatic, or uncertain using aBayesian model based on prior expectations informed by databases ofgermline and cancer variants. In some embodiments, uncertain variantsare treated as somatic for filtering and reporting purposes.

In some embodiments, genomic rearrangements (e.g., inversions,translocations, and gene fusions) are detected following de-multiplexingby aligning tumor FASTQ files against a human reference genome using alocal alignment algorithm, such as BWA. In some embodiments, DNA readsare sorted, and duplicates may be marked with a software, for example,SAMBlaster. Discordant and split reads may be further identified andseparated. These data may be read into a software, for example, LUMPY,for structural variant detection. In some embodiments, structuralalterations are grouped by type, recurrence, and presence and storedwithin a database and displayed through a fusion viewer software tool.The fusion viewer software tool may reference a database, for example,Ensembl, to determine the gene and proximal exons surrounding thebreakpoint for any possible transcript generated across the breakpoint.The fusion viewer tool may then place the breakpoint 5′ or 3′ to thesubsequent exon in the direction of transcription. For inversions, thisorientation may be reversed for the inverted gene. After positioning ofthe breakpoint, the translated amino acid sequences may be generated forboth genes in the chimeric protein, and a plot may be generatedcontaining the remaining functional domains for each protein, asreturned from a database, for example, Uniprot.

For instance, in an example implementation, gene rearrangements aredetected using the SpeedSeq analysis pipeline. Chiang et al., 2015,“SpeedSeq: ultra-fast personal genome analysis and interpretation,” NatMethods, (12), pg. 966. Briefly, FASTQ files are aligned to hg19 usingBWA. Split reads mapped to multiple positions and read pairs mapped todiscordant positions are identified and separated, then utilized todetect gene rearrangements by LUMPY. Layer et al., 2014, “LUMPY: aprobabilistic framework for structural variant discovery,” Genome Biol,(15), pg. 84. Fusions can then be filtered according to the number ofsupporting reads.

In some embodiments, putative fusion variants supported by fewer than aminimum number of unique sequence reads are filtered. In someembodiments, the minimum number of unique sequence reads is 2, 3, 4, 5,6, 7, 8, 9, 10, 12, 15, or 20 unique sequence reads.

Allelic Fraction Determination

In some embodiments, the analysis of aligned sequence reads, e.g., inSAM or BAM format, includes determination of variant allele fractions(133) for one or more of the variant alleles 132 identified as describedabove. In some embodiments, a variant allele fraction module 151 talliesthe instances that each allele is represented by a unique sequence readencompassing the variant locus of interest, generating a count for eachallele represented at that locus. In some embodiments, these tallies areused to determine the ratio of the variant allele, e.g., an allele otherthan the most prevalent allele in the subject's population for arespective locus, to a reference allele. This variant allele fraction133 can be used in several places in the feature extraction 206workflow. For instance, in some embodiments, a variant allele fractionis used during annotations of identified variants, e.g., whendetermining whether the allele originated from a germline cell or asomatic cell. In other instances, a variant allele fraction is used in aprocess for estimating a tumor fraction for a liquid biopsy sample or atumor purity for a solid tumor fraction. For instance, variant allelefractions for a plurality of somatic alleles can be used to estimate thepercentage of sequence reads originating from one copy of a cancerouschromosome. Assuming a 100% tumor purity and that each cancer cellcaries one copy of the variant allele, the overall purity of the tumorcan be estimated. This estimate, of course, can be further correctedbased on other information extracted from the sequencing data, such ascopy number alterations, tumor ploidy aberrations, tumor heterozygosity,etc.

Methylation Determination

In some embodiments, where nucleic acid sequencing library was processedby bi-sulfite treatment or enzymatic methyl-cytosine conversion, asdescribed above, the analysis of aligned sequence reads, e.g., in SAM orBAM format, includes determination of methylation states 132 for one ormore loci in the genome of the patient. In some embodiments, methylationsequencing data is aligned to a reference sequence construct 158 in adifferent fashion than non-methylation sequencing, becausenon-methylated cytosines are converted to uracils, and the resultinguracils are ultimately sequenced as thymines, whereas methylatedcytosine are not converted and sequenced as cytosine. Differentapproaches, therefore, have to be used to align these modified sequencesto a reference sequence construct, such as seeding alignments withshorter regions of identity or converting all cytosines to thymidines inthe sequencing data and then aligning the data to reference sequenceconstructs for both the plus and minus strand of the sequence construct.For review of these approaches, see Zhou Q. et al., BMC Bioinformatics,20(47):1-11 (2019), the content of which is hereby incorporated byreference, in its entirety, for all purposes. Algorithms for callingmethylated bases are known in the art. For example, Bismark is able todistinguish between cytosines in CpG, CHG, and CHH contexts. Krueger F.and Andrews S R, Bioinformatics, 27(11):1571-71 (2011), the content ofwhich is hereby incorporated by reference, in its entirety, for allpurposes.

Copy Number Variation Analysis:

In some embodiments, the analysis of aligned sequence reads, e.g., inSAM or BAM format, includes determination of the copy number 135 for oneor more locus, using a copy number variation analysis module 153. Insome embodiments, where both a liquid biopsy sample and a normal tissuesample of the patient are analyzed, de-duplicated BAM files and a VCFgenerated from the variant calling pipeline are used to compute readdepth and variation in heterozygous germline SNVs between sequencingreads for each sample. By contrast, in some embodiments, where only aliquid biopsy sample is being analyzed, comparison between a tumorsample and a pool of process-matched normal controls is used. In someembodiments, copy number analysis includes application of a circularbinary segmentation algorithm and selection of segments with highlydifferential log₂ ratios between the cancer sample and its comparator(e.g., a matched normal or normal pool). In some embodiments,approximate integer copy number is assessed from a combination ofdifferential coverage in segmented regions and an estimate of stromaladmixture (for example, tumor purity, or the portion of a sample that iscancerous vs. non-cancerous, such as a tumor fraction for a liquidbiopsy sample) is generated by analysis of heterozygous germline SNVs.

For instance, in an example implementation, copy number variants (CNVs)are analyzed using the CNVkit package. Talevich et al., PLoS ComputBiol, 12:1004873 (2016), the content of which is hereby incorporated byreference, in its entirety, for all purposes. CNVkit is used for genomicregion binning, coverage calculation, bias correction, normalization toa reference pool, segmentation and visualization. The log₂ ratiosbetween the tumor sample and a pool of process matched healthy samplesfrom the CNVkit output are then annotated and filtered using statisticalmodels whereby the amplification status (amplified or not amplified) ofeach gene is predicted and non-focal amplifications are removed.

In some embodiments, copy number variations (CNVs) are analyzed using acombination of an open-source tool, such as CNVkit, and anannotation/filtering algorithm, e.g., implemented via a python script.CNVkit is used initially to perform genomic region binning, coveragecalculation, bias correction, normalization to a reference pool,segmentation and, optionally, visualization. The bin-level copy ratiosand segment-level copy ratios, in addition to their correspondingconfidence intervals, from the CNVkit output are then used in theannotation and filtering where the copy number state (amplified,neutral, deleted) of each segment and bin are determined and non-focalamplifications/deletions are filtered out based on a set of acceptancecriteria. In some embodiments, one or more copy number variationsselected from amplifications in the MET, EGFR, ERBB2, CD274, CCNE1, andMYC genes, and deletions in the BRCA1 and BRCA2 genes are analyzed.However, the methods described herein is not limited to only thesereportable genes.

In some embodiments, CNV analysis is performed using a tumor BAM file, atarget region BED file, a pool of process matched normal samples, andinputs for initial reference pool construction. Inputs for initialreference pool construction include one or more of normal BAM files, ahuman reference genome file, mappable regions of the genome, and ablacklist that contains recurrent problematic areas of the genome.

CNVkit utilizes both targeted captured sequencing reads andnon-specifically captured off-target reads to infer copy numberinformation. The targeted genomic regions specified in the probe targetBED file are divided to target bins with an average size of, e.g., 100base pairs, which can be specified by the user. The genomic regionsbetween the target regions, e.g., excluding regions that cannot bemapped reliably, are automatically divided into off-target (alsoreferred to as anti-target) bins with an average size of, e.g., 150 kbp,which again can be specified by the user. Raw log₂-transformed depthsare then calculated from the alignments in the input BAM file andwritten to two tab-delimited .cnn files, one for each of the target andoff-target bins.

A pooled reference is constructed from a panel of process matched normalsamples. The raw log₂ depths of target and off-target bins in eachnormal sample are computed as described above, and then each aremedian-centered and corrected for bias including GC content, genomesequence repetitiveness, target size, and/or spacing. The correctedtarget and off-target log₂ depths are combined, and a weighted averageand spread are calculated as Tukey's biweight location and midvariancein each bin. These values are written to a tab delimited reference .cnnfile, which is used to normalize an input tumor sample as follows.

The raw log₂ depths of an input sample are median-centered andbias-corrected as described in the reference construction. The correctedlog₂ depth of each bin is then subtracted by the corresponding log₂depth in the reference file, resulting in the log₂ copy ratios (alsoreferred to as copy ratios or log₂ ratios) between the input tumorsample and the reference pool. These values are written to atab-delimited .cnr file.

The copy ratios are then segmented, e.g., via a circular binarysegmentation (CBS) algorithm or another suitable segmentation algorithm,whereby adjacent bins are grouped to larger genomic regions (segments)of equal copy number. The segment's copy ratio is calculated as theweighted mean of all bins within the segment. The confidence interval ofthe segment mean is estimated by bootstrapping the bin-level copy ratioswithin the segment. The segments' genomic ranges, copy ratios andconfidence intervals are written to a tab-delimited .cns file.

Log₂ transformed copy ratios, log₂ copy ratios, log₂-transformed depths,log₂-transformed read depths, log₂ depths, corrected log₂ depths, log₂ratios, log₂ read depths, and log₂ depth correction values have beendiscussed herein by way of example. In each instance where such a termis used, it will be appreciated that log base 2 is presented by way ofexample only and that the present disclosure is not so limited. Indeed,logarithms to any base N may be used, (e.g., where N is a positivenumber greater than 1 for instance), and thus the present disclosurefully supports log_(N) transformed copy ratios, log_(N) copy ratios,log_(N)-transformed depths, log_(N)-transformed read depths, log_(N)depths, corrected log_(N) depths, log_(N) ratios, log_(N) read depths,and log_(N) depth correction values as respective substitutes for log₂transformed copy ratios, log₂ copy ratios, log₂-transformed depths,log₂-transformed read depths, log₂ depths, corrected log₂ depths, log₂ratios, log₂ read depths, and log₂ depth correction values.

Microsatellite Instability (MSI):

In some embodiments, analysis of aligned sequence reads, e.g., in SAM orBAM format, includes analysis of the microsatellite instability status137 of a cancer, using a microsatellite instability analysis module 154.In some embodiments, an MSI classification algorithm classifies a cancerinto three categories: microsatellite instability-high (MSI-H),microsatellite stable (MSS), or microsatellite equivocal (MSE).Microsatellite instability is a clinically actionable genomic indicationfor cancer immunotherapy. In microsatellite instability-high (MSI-H)tumors, defects in DNA mismatch repair (MMR) can cause a hypermutatedphenotype where alterations accumulate in the repetitive microsatelliteregions of DNA. MSI detection is conventionally performed by subjectingtumor tissue (“solid biopsy”) to clinical next-generation sequencing orspecific assays, such as MMR IHC or MSI PCR.

For example, microsatellite instability status can be assessed bydetermining the number of repeating units present at a plurality ofmicrosatellite loci, e.g., 5, 10, 15, 20, 25, 30, 40, 50, 75, 100, 250,500, 750, 1000, 2500, 5000, or more loci. In some embodiments, onlyreads encompassing a microsatellite locus that include a significantnumber of flanking nucleotides on both ends, e.g., at least 5, 10, 15,or more nucleotides flanking each end, are used for the analysis inorder to avoid using reads that do not completely cover the locus. Insome embodiments, a minimal number of reads, e.g., at least 5, 10, 20,30, 40, 50, or more reads have to meet this criteria in order to use aparticular microsatellite locus, in order to ensure the accuracy of thedetermination given the high incidence of polymerase slipping duringreplication of these repeated sequences.

In some embodiments, each locus is tested individually for instability,e.g., as measured by a change or variance in the number of nucleotidebase repeats, e.g., in cancer-derived nucleotide sequences relative to anormal sample or standard, for example, using the Kolmogorov-Smirnovtest. For example, if p≤0.05, the locus is considered unstable. Theproportion of unstable microsatellite loci may be fed into a logisticregression classifier trained on samples from various cancer types,especially cancer types which have clinically determined MSI statuses,for example, colorectal and endometrial cohorts. For MSI testing whereonly a liquid biopsy sample is analyzed, the mean and variance for thenumber of repeats may be calculated for each microsatellite locus. Avector containing the mean and variance data may be put into aclassifier (e.g., a support vector machine classification algorithm)trained to provide a probability that the patient is MSI-H, which may becompared to a threshold value. In some embodiments, the threshold valuefor calling the patient as MSI-H is at least 60% probability, or atleast 65% probability, 70% probability, 75% probability, 80%probability, or greater. In some embodiments, a baseline threshold maybe established to call the patient as MSS. In some embodiments, thebaseline threshold is no more than 40%, or no more than 35% probability,30% probability, 25% probability, 20% probability, or less. In someembodiments, when the output of the classifier falls within the rangebetween the MSI-H and MSS thresholds, the patient is identified as MSE.

Other methods for determining the MSI status of a subject are known inthe art. For example, in some embodiments, microsatellite instabilityanalysis module 154 employs an MSI evaluation methods described in U.S.Provisional Patent Application Ser. No. 62/881,845, filed Aug. 1, 2019,or U.S. Provisional Application Ser. No. 62/931,600, filed Nov. 6, 2019,the contents of which are hereby incorporated by reference, in theirentireties, for all purposes.

Tumor Mutational Burden (TMB):

In some embodiments, the analysis of aligned sequence reads, e.g., inSAM or BAM format, includes determination of a mutation burden for thecancer (e.g., a tumor mutational burden 136), using a tumor mutationalburden analysis module 155. Generally, a tumor mutational burden is ameasure of the mutations in a cancer per unit of the patient's genome.For example, a tumor mutational burden may be expressed as a measure ofcentral tendency (e.g., an average) of the number of somatic variantsper million base pairs in the genome. In some embodiments, a tumormutational burden refers to only a set of possible mutations, e.g., oneor more of SNVs, MNVs, indels, or genomic rearrangements. In someembodiments, a tumor mutational burden refers to only a subset of one ormore types of possible mutations, e.g., non-synonymous mutations,meaning those mutations that alter the amino acid sequence of an encodedprotein. In other embodiments, for example, a tumor mutational burdenrefers to the number of one or more types of mutations that occur inprotein coding sequences, e.g., regardless of whether they change theamino acid sequence of the encoded protein.

As an example, in some embodiments, a tumor mutational burden (TMB) iscalculated by dividing the number of mutations (e.g., all variants ornon-synonymous variants) identified in the sequencing data (e.g., asrepresented in a VCF file) by the size (e.g., in megabases) of a captureprobe panel used for targeted sequencing. In some embodiments, a variantis included in tumor mutation burden calculation only when certaincriteria are met. For instance, in some embodiments, a thresholdsequence coverage for the locus associated with the variant must be metbefore the variant is included in the calculation, e.g., at least 25×,50×, 75×, 100×, 250×, 500×, or greater. Similarly, in some embodiments,a minimum number of unique sequence reads encompassing the variantallele must be identified in the sequencing data, e.g., at least 4, 5,6, 7, 8, 9, 10, or more unique sequence reads. In some embodiments, athreshold variant allelic fraction threshold must be satisfied beforethe variant is included in the calculation, e.g., at least 0.01%, 0.1%,0.25%, 0.5%, 0.75%, 1%, 1.5%, 2%, 2.5%, 3%, 4%, 5%, or greater. In someembodiments, an inclusion criteria may be different for different typesof variants and/or different variants of the same type. For instance, avariant detected in a mutation hotspot within the genome may face lessrigorous criteria than a variant detected in a more stable locus withinthe genome.

Other methods for calculating tumor mutation burden in liquid biopsysamples and/or solid tissue samples are known in the art. See, forexample, Fenizia F. et al., Transl Lung Cancer Res., 7(6):668-77 (2018)and Georgiadis A et al., Clin. Cancer Res., 25(23):7024-34 (2019), thedisclosures of which are hereby incorporated by reference, in theirentireties, for all purposes.

Homologous Recombination Status (HRD):

In some embodiments, analysis of aligned sequence reads, e.g., in SAM orBAM format, includes analysis of whether the cancer is homologousrecombination deficient (HRD status 137-3), using a homologousrecombination pathway analysis module 157.

Homologous recombination (HR) is a normal, highly conserved DNA repairprocess that enables the exchange of genetic information betweenidentical or closely related DNA molecules. It is most widely used bycells to accurately repair harmful breaks (e.g., damage) that occur onboth strands of DNA. DNA damage may occur from exogenous (external)sources like UV light, radiation, or chemical damage; or from endogenous(internal) sources like errors in DNA replication or other cellularprocesses that create DNA damage. Double strand breaks are a type of DNAdamage. Using poly (ADP-ribose) polymerase (PARP) inhibitors in patientswith HRD compromises two pathways of DNA repair, resulting in cell death(apoptosis). The efficacy of PARP inhibitors is improved not only inovarian cancers displaying germline or somatic BRCA mutations, but alsoin cancers in which HRD is caused by other underlying etiologies.

In some embodiments, HRD status can be determined by inputting featurescorrelated with HRD status into a classifier trained to distinguishbetween cancers with homologous recombination pathway deficiencies andcancers without homologous recombination pathway deficiencies. Forexample, in some embodiments, the features include one or more of (i) aheterozygosity status for a first plurality of DNA damage repair genesin the genome of the cancerous tissue of the subject, (ii) a measure ofthe loss of heterozygosity across the genome of the cancerous tissue ofthe subject, (iii) a measure of variant alleles detected in a secondplurality of DNA damage repair genes in the genome of the canceroustissue of the subject, and (iv) a measure of variant alleles detected inthe second plurality of DNA damage repair genes in the genome of thenon-cancerous tissue of the subject. In some embodiments, all four ofthe features described above are used as features in an HRD classifier.More details about HRD classifiers using these and other features aredescribed in U.S. patent application Ser. No. 16/789,363, filed Feb. 12,2020, the content of which is hereby incorporated by reference, in itsentirety, for all purposes.

Circulating Tumor Fraction:

In some embodiments, the analysis of aligned sequence reads, e.g., inSAM or BAM format, includes estimation of a circulating tumor fractionfor the liquid biopsy sample. Tumor fraction or circulating tumorfraction is the fraction of cell free nucleic acid molecules in thesample that originates from a cancerous tissue of the subject, ratherthan from a non-cancerous tissue (e.g., a germline or hematopoietictissue). Several open source analysis packages have modules forcalculating tumor fraction from solid tumor samples. For instance,PureCN (Riester, M., et al., Source Code Biol Med, 11:13 (2016)) isdesigned to estimate tumor purity from targeted short-read sequencingdata of solid tumor samples. Similarly, FACETS (Shen R, Seshan V E,Nucleic Acids Res., 44(16):e131 (2016)) is designed to estimate tumorfraction from sequencing data of solid tumor samples. However,estimating tumor fraction from a liquid biopsy sample is more difficultbecause of the, generally, lower tumor fraction relative to a solidtumor sample and typic small size of a targeted panel used for liquidbiopsy sequencing. Indeed, packages such as PureCN and FACETS performpoorly at low tumor fractions and with sequencing data generated usingsmall targeted-panels.

In some embodiments, circulating tumor fraction is estimated from atargeted-panel sequencing reaction of a liquid biopsy sample using anoff-target read methodology, e.g., as described herein with reference toFIG. 4F. Briefly, a circulating tumor fraction estimate is determinedfrom reads in the target captured regions, as well as off-target readsuniformly distributed across the human reference genome. Segments havingsimilar copy ratios, e.g., as assigned via circular binary segmentation(CBS) during CNV analysis, are fit to integer copy states, e.g., via anexpectation-maximization algorithm using the sum of squared error of thesegment log2 ratios (normalized to genomic interval size) to expectedratios given a putative copy state and tumor fraction. For moreinformation on expectation maximization algorithms see, for example,Sundberg, Rolf (1974). “Maximum likelihood theory for incomplete datafrom an exponential family”. Scandinavian Journal of Statistics. 1 (2):49-58, the content of which is hereby incorporated by reference in itsentirety. A measure of fit between corresponding segment-level coverageratios and assigned integer copy states across the plurality ofsimulated circulating tumor fractions is then used to select thesimulated circulating tumor fraction to be used as the circulating tumorfraction for the liquid biopsy sample. In some embodiments, errorminimization is used to identify the simulated tumor fraction providingthe best fit to the data.

In some embodiments, a measure of fit between correspondingsegment-level coverage ratios and assigned integer copy states acrossthe plurality of simulated circulating tumor fractions (e.g., using anerror minimization algorithm) provides a number of local optima (e.g.,local minima for an error minimization model or local maxima for a fixmaximization model) for the best fit between the segment-level coverageratios and assigned integer copy states. In some such embodiments, asecond estimate of circulating tumor fraction is used to select thelocal optima (e.g., the local minima in best agreement with the secondestimate of circulating tumor fraction) to be used as the circulatingtumor fraction for the liquid biopsy sample.

For example, in some embodiments, multiple local optima (e.g., minima)can be disambiguated based on a difference between somatic and germlinevariant allele fractions. The assumption is that the variant allelefraction (VAF) of germline variants that exhibit loss of heterozygosity(LOH) will increase or decrease by the amount approximately equal tohalf of the tumor purity (e.g., the circulating tumor fraction for aliquid biopsy sample). With a matched normal sample (e.g., wheresequencing data for both a liquid biopsy sample and a non-canceroussample from the subject is available, or where sequencing data for botha solid tumor sample and a non-cancerous sample from the subject isavailable), for a given heterozygous germline variant, the VAF delta canbe calculated as delta=abs(VAF_(tumor)−VAF_(normal)). However, for tumoronly sequencing (e.g., where sequencing data is only available for aliquid biopsy sample or a solid tumor sample), the VAF_(normal) isunknown. In some embodiments, the VAF_(normal) is assumed to be 50%. Toincrease statistical power and account for the imprecision in the VAF bysequencing, the delta for all such variants are calculated and thecirculating tumor fraction estimate (ctFE) for this method is calculatedas ctFE=max(2×delta) for all variant delta values. While this can beused as a method for ctFE alone, its precision is limited by the numberof detected LOH variants. For a small panel, there are few expected LOHvariants and thus the ctFE may not be precise on its own. However, itcan be used to disambiguate multiple local optima (e.g., minima),especially for high tumor fraction values estimated by the off-targetread methodology described herein. For that, the off-target readmethodology ctFE peaks corresponding to all the local optima (e.g.,minima) are identified and the one closest to the ctFE estimated by LOHdelta is chosen as the most likely global optima (e.g., minima).

Several other methods may also be used to estimate circulating tumorfractions. In some embodiments, these methods are used in combinationwith the off-target tumor estimate method described herein. For example,in some embodiments, one or more of these methodologies is used togenerate an estimate of tumor fraction, which is then used to identifythe nearest local optima (e.g., minima) obtained from the tumor fractionestimation methods described above, and further herein.

For example, the ichorCNA package applies a probabilistic model tonormalized read coverages from ultra-low pass whole genome sequencingdata of cell-free DNA to estimate tumor fraction in the liquid biopsysample. For more information, see, Adalsteinsson, V. A. et al., NatCommun 8:1324 (2017), the content of which is disclosed herein for itsdescription of a probabilistic tumor fraction estimation model in the“methods” section. Similarly, Tiancheng H. et al., describe a MaximumLikelihood model based on the copy number of an allele in the sample andvariant allele frequency in paired-control samples. For moreinformation, see, Tiancheng H. et al., Journal of Clinical Oncology37:15 suppl, e13053-e13053 (2019), the content of which is disclosedherein for its description of a Maximum Likelihood tumor fractionestimation model.

In some embodiments, a statistic for somatic variant allele fractionsdetermined for the liquid biopsy sample is used as an estimate for thecirculating tumor fraction of the liquid biopsy sample. For example, insome embodiments, a measure of central tendency (e.g., a mean or median)for a plurality of variant allele fractions determined for the liquidbiopsy sample is used as an estimate of circulating tumor fraction. Insome embodiments, a lowest (minimum) variant allele fraction determinedfor the liquid biopsy sample is used as an estimate of circulating tumorfraction. In some embodiments, a highest (maximum) variant allelefraction determined for the liquid biopsy sample is used as an estimateof circulating tumor fraction. In some embodiments, a range defined bytwo or more of these statistics is used to limit the range of simulatedtumor fraction analysis via the off-target read methodology describedherein. For instance, in some embodiments, lower and upper bounds of thesimulated tumor fraction analysis are defined by the minimum variantallele fraction and the maximum variant allele fraction determined for aliquid biopsy sample, respectively. In some embodiments, the range isfurther expanded, e.g., on either or both the lower and upper bounds.For example, in some embodiments, the lower bound of a simulated tumorfraction analysis is defined as 0.5-times the minimum variant allelefraction, 0.75-times the minimum variant allele fraction, 0.9-times theminimum variant allele fraction, 1.1-times the minimum variant allelefraction, 1.25-times the minimum variant allele fraction, 1.5-times theminimum variant allele fraction, or a similar multiple of the minimumvariant allele fraction determined for the liquid biopsy sample.Similarly, in some embodiments, the upper bound of a simulated tumorfraction analysis is defined as 2.5-times the maximum variant allelefraction, 2-times the maximum variant allele fraction, 1.75-times themaximum variant allele fraction, 1.5-times the maximum variant allelefraction, 1.25-times the maximum variant allele fraction, 1.1-times themaximum variant allele fraction, 0.9-times the maximum variant allelefraction, or a similar multiple of the maximum variant allele fractiondetermined for the liquid biopsy sample.

In some embodiments, circulating tumor fraction is estimated based on adistribution of the lengths of cfDNA in the liquid biopsy sample. Insome embodiments, sequence reads are binned according to their positionwithin the genome, e.g., as described elsewhere herein. For each bin,the length of each fragment is determined. Each fragment is thenclassified as belonging to one of a plurality of classes, e.g., one oftwo classes corresponding to a population of short fragments and apopulation of long fragments. In some embodiments, the classification isperformed using a static length threshold, e.g., that is the same acrossall the bins. In some embodiments, the classification is performed usinga dynamic length threshold. In some embodiments, a dynamic lengththreshold is determined by comparing the distribution of fragmentlengths in liquid biopsy samples from reference subjects that do nothave cancer to the distribution of fragment lengths in liquid biopsysamples from reference subjects that have cancer, in a positionalfashion.

For example, in some embodiments, the comparison is done over windowsspanning entire chromosomes, e.g., each chromosome defines a comparisonwindow over which a dynamic length threshold is determined. In someembodiments, the comparison is done over a window spanning a single bin,e.g., each bin defines a comparison window over which a dynamic lengththreshold is determined. In certain embodiments, the bin determinationmay be made according to various genomic features. For example, thecomparison window may be based on a chromosome by chromosome basis, or achromosomal arm by chromosomal arm basis. In some embodiments, thecomparison window is based on a gene level basis. In some embodiments,the comparison window is a fixed size, such as 1 KB, 5 KB, 10 KB, 25 kB,50 kB, 100 kB, 25 KB, 500 KB, 1 MB, 2 MB, 3 MB, or more. In someembodiments, the reference subjects having cancer used to determine thedynamic fragment length is matched to the cancer type of the subjectwhose liquid biopsy sample is being evaluated.

Once each fragment is classified as belonging to either the populationof short fragments or the population of long fragments, a model trainedto estimate circulating tumor fraction based on fragment lengthdistribution data across the genome is applied to the binned data togenerate an estimate of the circulating tumor fraction for the liquidbiopsy sample. In some embodiments, a comparison of (i) the populationof short fractions and (ii) the population of long fragments is made foreach bin, e.g., a fraction of the number of short fragments to thenumber of long fragments in each bin is determined and used as an inputfor the model. In some embodiments, the model is a probabilistic model(e.g., an application of Bayes theorem), a deep learning model (e.g., aneural network, such as a convolutional neural network), or an admixturemodel.

In some embodiments, two or more of the circulating tumor estimationmodels described herein are used to generate respective tumor fractionestimates, which are combined to form a final tumor fraction estimate.For example, in some embodiments, a measure of central tendency (e.g., amean) for several tumor fraction estimates is determined and used as thefinal tumor fraction estimate. In some embodiments, a tumor fractionestimate derived from a plurality of estimation models, e.g., a measureof central tendency for several tumor fraction estimates is used toidentify the nearest local optima (e.g., minima) obtained from the tumorfraction estimation methods described above, and further herein.

Concurrent Testing

Unless stated otherwise, as used herein, the term “concurrent” as itrelates to assays refers to a period of time between zero and ninetydays. In some embodiments, concurrent tests using different biologicalsamples from the same subject (e.g., two or more of a liquid biopsysample, cancerous tissue—such as a solid tumor sample or blood samplefor a blood-based cancer—and a non-cancerous sample) are performedwithin a period of time (e.g., the biological samples are collectedwithin the period of time) of from 0 days to 90 days. In someembodiments, concurrent tests using different biological samples fromthe same subject (e.g., two or more of a liquid biopsy sample, canceroustissue—such as a solid tumor sample or blood sample for a blood-basedcancer—and a non-cancerous sample) are performed within a period of time(e.g., the biological samples are collected within the period of time)of from 0 days to 60 days. In some embodiments, concurrent tests usingdifferent biological samples from the same subject (e.g., two or more ofa liquid biopsy sample, cancerous tissue—such as a solid tumor sample orblood sample for a blood-based cancer— and a non-cancerous sample) areperformed within a period of time (e.g., the biological samples arecollected within the period of time) of from 0 days to 30 days. In someembodiments, concurrent tests using different biological samples fromthe same subject (e.g., two or more of a liquid biopsy sample, canceroustissue—such as a solid tumor sample or blood sample for a blood-basedcancer—and a non-cancerous sample) are performed within a period of time(e.g., the biological samples are collected within the period of time)of from 0 days to 21 days. In some embodiments, concurrent tests usingdifferent biological samples from the same subject (e.g., two or more ofa liquid biopsy sample, cancerous tissue—such as a solid tumor sample orblood sample for a blood-based cancer—and a non-cancerous sample) areperformed within a period of time (e.g., the biological samples arecollected within the period of time) of from 0 days to 14 days. In someembodiments, concurrent tests using different biological samples fromthe same subject (e.g., two or more of a liquid biopsy sample, canceroustissue—such as a solid tumor sample or blood sample for a blood-basedcancer—and a non-cancerous sample) are performed within a period of time(e.g., the biological samples are collected within the period of time)of from 0 days to 7 days. In some embodiments, concurrent tests usingdifferent biological samples from the same subject (e.g., two or more ofa liquid biopsy sample, cancerous tissue—such as a solid tumor sample orblood sample for a blood-based cancer—and a non-cancerous sample) areperformed within a period of time (e.g., the biological samples arecollected within the period of time) of from 0 days to 3 days.

In some embodiments, a liquid biopsy assay may be used concurrently witha solid tumor assay to return more comprehensive information about apatient's variants. For example, a blood specimen and a solid tumorspecimen may be sent to a laboratory for evaluation. The solid tumorspecimen may be analyzed using a bioinformatics pipeline to produce asolid tumor result. A solid tumor assay is described, for instance, inU.S. patent application Ser. No. 16/657,804, the content of which ishereby incorporated by reference, in its entirety, for all purposes. Thecancer type of the solid tumor may include, for example, non small celllung cancer, colorectal cancer, or breast cancer. Alterations identifiedin the tumor/matched normal result may include, for example, EGFR+ fornon small cell lung cancer; HER2+ for breast cancer; or KRAS G12C forseveral cancers.

In some embodiments, a blood specimen may be divided into a firstportion and a second portion. The first portion of the blood specimenand the solid tumor specimen may be analyzed using a bioinformaticspipeline to produce a tumor/matched normal result. The second portion ofthe blood specimen may be analyzed using a bioinformatics pipeline toproduce a liquid biopsy result. For example, the blood specimen may beanalyzed using at least an improvement in somatic variantidentification, e.g., as described herein in the section entitled“Systems and Methods for Improved Validation of Somatic SequenceVariants” and/or “Variant Identification.” For example, the bloodspecimen may be analyzed using an improvement in focal copy numberidentification, e.g., as described herein in the section entitled “CopyNumber Variation.” For example, the blood specimen may be analyzed usingan improvement in circulating tumor fraction determination, e.g., asdescribed above in the section entitled “Circulating Tumor Fraction.”

Therapies may be identified for further consideration in response toreceiving the tumor or tumor/matched normal result along with the liquidbiopsy result. For example, if the results overall indicate that thepatient has HER2+ breast cancer, neratinib may be identified along withthe test results for further consideration by the ordering clinician.

The solid tumor or tumor/matched normal assay may be orderedconcurrently; their results may be delivered concurrently; and they maybe analyzed concurrently.

Quality Control

In some embodiments, a positive sensitivity control sample is processedand sequenced along with one or more clinical samples. In someembodiments, the control sample is included in at least one flow cell ofa multi-flow cell reaction and is processed and sequenced each time aset of samples is sequenced or periodically throughout the course of aplurality of sets of samples. In some embodiments, the control includesa pool of controls. In some embodiments, a quality control analysisrequires that read metrics of variants present in the control samplefall within acceptable criteria. In some embodiments, a quality controlrequires approval by a pathologist before the results are reported.

In some embodiments, the quality control system includes methods thatpass samples for reporting if various criteria are met. Similarly, insome embodiments, the system includes methods that allow for more manualreview if a sample does not meet the criteria established for automaticpass. In some embodiments, the criteria for pass of panel sequencingresults include one or more of the following:

-   -   A criterion for the on-target rate of the sequencing reaction,        defined as a comparison (e.g., a ratio) of (i) the number of        sequenced nucleotides or reads falling within the targeted panel        region of a genome and (ii) the number of sequenced nucleotides        or reads falling outside of the targeted panel region of the        genome. Generally, an on-target rate threshold will be selected        based on the sequencing technology used, the size of the        targeted panel, and the expected number of sequence reads        generated by the combination of the technology and targeted        panel used. For example, in some embodiments where next        generation sequencing-by-synthesis technology is used, the        criterion is implemented as a minimum on-target rate threshold        of at least 30%, at least 40%, at least 50%, at least 60%, at        least 70%, or greater. In some embodiments, the on-target rate        criteria is implemented as a range of acceptable on-target        rates, e.g., requiring that the on-target rate for a reaction is        from 30% to 70%, from 30% to 80%, from 40% to 70%, from 40% to        80%, and the like.    -   A criterion for the number of total reads generated by the        sequencing reaction, including both unique sequence reads and        non-unique sequence reads. Generally, a total read number        threshold will be selected based on the sequencing technology        used, the size of the targeted panel, and the expected number of        sequence reads generated by the combination of the technology        and targeted panel used. For example, in some embodiments where        next generation sequencing-by-synthesis technology is used, the        criterion is implemented as a minimum number of total reads        threshold of at least 100 million, 110 million, 120 million, 130        million, 140 million, 150 million, 160 million, 170 million, 180        million, 190 million, 200 million, or more total sequence reads.        In some embodiments, the criterion is implemented as a range of        acceptable number of total reads, e.g., requiring that the        sequencing reaction generate from 50 million to 300 million        total sequence reads, from 100 million to 300 million sequence        reads, from 100 million to 200 million sequence reads, and the        like.    -   A criterion for the number of unique reads generated by the        sequencing reaction. Generally, a unique read number threshold        will be selected based on the sequencing technology used, the        size of the targeted panel, and the expected number of sequence        reads generated by the combination of the technology and        targeted panel used. For example, in some embodiments where next        generation sequencing-by-synthesis technology is used, the        criterion is implemented as a minimum number of total reads        threshold of at least 3 million, 4 million, 5 million, 6        million, 7 million, 8 million, 9 million, or more unique        sequence reads. In some embodiments, the criterion is        implemented as a range of acceptable number of unique reads,        e.g., requiring that the sequencing reaction generate from 2        million to 10 million total sequence reads, from 3 million to 9        million sequence reads, from 3 million to 9 million sequence        reads, and the like.    -   A criterion for unique read depth across the panel, defined as a        measure of central tendency (e.g., a mean or median) for a        distribution of the number of unique reads in the sequencing        reaction encompassing the genomic regions targeted by each        probe. For instance, in some embodiments, an average unique read        depth is calculated for each targeted region defined in a target        region BED file, using a first calculation of the number of        reads mapped to the region multiplied by the read length,        divided by the length of the region, if the length of the region        is longer than the read length, or otherwise using a second        calculation of the number of reads falling within the region        multiplied by the read length. The median of unique read depth        across the panel is then calculated as the median of those        average unique read depths of all targeted regions. In some        embodiments, the resolution as to how depth is calculated is        increased or decreased, e.g., in cases where it is necessary or        desirable to calculate depth for each base, or for a single        gene. Generally, a unique read depth threshold will be selected        based on the sequencing technology used, the size of the        targeted panel, and the expected number of sequence reads        generated by the combination of the technology and targeted        panel used. For example, in some embodiments where next        generation sequencing-by-synthesis technology is used, the        criterion is implemented as a minimum unique read depth        threshold of at least 1500, 1750, 2000, 2250, 2500, 2750, 3000,        3250, 3500, or higher unique read depth. In some embodiments,        the criterion is implemented as a range of acceptable unique        read depth, e.g., requiring that the sequencing reaction        generate a unique read depth of from 1000 to 4000, from 1500 to        4000, from 1500 to 4000, and the like.    -   A criterion for the unique read depth of a lowest percentile        across the panel, defined as a measure of central tendency        (e.g., a mean or median) for a distribution of the number of        unique reads in the sequencing reaction encompassing the genomic        regions targeted by each probe that fall within the lowest        percentile of genomic regions by read depth (e.g., the first,        second, third, fourth, fifth, tenth, fifteenth, twentieth,        twenty-fifth, or similar percentile). Generally, a unique read        depth at a lowest percentile threshold will be selected based on        the sequencing technology used, the size of the targeted panel,        the lowest percentile selected, and the expected number of        sequence reads generated by the combination of the technology        and targeted panel used. For example, in some embodiments where        next generation sequencing-by-synthesis technology is used, the        criterion is implemented as a minimum unique read depth        threshold at the fifth percentile of at least 500, 750, 1000,        1250, 1500, 1750, 2000, 2250, 2500, or higher unique read depth.        In some embodiments, the criterion is implemented as a range of        acceptable unique read depth at the fifth percentile, e.g.,        requiring that the sequencing reaction generate a unique read        depth at the fifth percentile of from 250 to 3000, from 500 to        3000, from 500 to 2500, and the like.    -   A criterion for the deamination or OxoG Q-score of a sequencing        reaction, defined as a Q-score for the occurrence of artifacts        arising from template oxidation/deamination. Generally, a        deamination or OxoG Q-score threshold will be selected based on        the sequencing technology used. For example, in some embodiments        where next generation sequencing-by-synthesis technology is        used, the criterion is implemented as a minimum deamination or        OxoG Q-score threshold of at least 10, 20, 30, 40, 5,0 6,0 70,        80, 90, or higher. In some embodiments, the criterion is        implemented as a range of acceptable deamination or OxoG        Q-scores, e.g., from 10 to 100, from 10 to 90, and the like.    -   A criterion for the estimated contamination fraction is of a        sequencing reaction, defined as an estimate of the fraction of        template fragments in the sample being sequenced arising from        contamination of the sample, commonly expressed as a decimal,        e.g., where 1% contamination is expressed as 0.01. An example        method for estimating contamination in a sequencing method is        described in Jun G. et al., Am. J. Hum. Genet., 91:839-48        (2012). For example, in some embodiments, the criterion is        implemented as a maximum contamination fraction threshold of no        more than 0.001, 0.0015, 0.002, 0.0025, 0.003, 0.0035, 0.004. In        some embodiments, the criterion is implemented as a range of        acceptable contamination fractions, e.g., from 0.0005 to 0.005,        from 0.0005 to 0.004, from 0.001 to 0.004, and the like.    -   A criterion for the fingerprint correlation score of a        sequencing reaction, defined as a Pearson correlation        coefficient calculated between the variant allele fractions of a        set of pre-defined single nucleotide polymorphisms (SNPs) in two        samples. An example method for determining a fingerprint        correlation score is described in Sejoon L. et al., Nucleic        Acids Research, Volume 45, Issue 11, 20 Jun. 2017, Page e103,        the content of which is incorporated herein by reference, in its        entirety, for all purposes. For example, in some embodiments,        the criterion is implemented as a minimum fingerprint        correlation score threshold of at least 0.1, 0.2, 0.3, 0.4, 0.5,        0.6, 0.7, 0.8, 0.9, or higher. In some embodiments, the        criterion is implemented as a range of acceptable fingerprint        correlation scores, e.g., from 0.1 to 0.9, from 0.2 to 0.9, from        0.3 to 0.9, and the like.    -   A criterion for the raw coverage of a minimum percentage of the        genomic regions targeted by a probe, defined as a minimum number        of unique reads in the sequencing reaction encompassing each of        a minimum percentage (e.g., at least 80%, 85%, 90%, 95%, 98%,        99%, 99.5%, 99.9%, and the like) of the genomic regions targeted        by the probe panel. In some embodiments, the term “unique read        depth” is used to distinguish deduplicated reads from raw reads        that may contain multiple reads sequenced from the same original        DNA molecule via PCR. Generally, a raw coverage of a minimum        percentage of the genomic regions targeted by a probe threshold        will be selected based on the sequencing technology used, the        size of the targeted panel, the minimum percentage selected, and        the expected number of sequence reads generated by the        combination of the technology and targeted panel used. For        example, in some embodiments where next generation        sequencing-by-synthesis technology is used, the criterion is        implemented as a raw coverage of 95% of the genomic regions        targeted by a probe threshold of at least 500, 750, 1000, 1250,        1500, 1750, 2000, 2250, 2500, or higher unique read depth. In        some embodiments, the criterion is implemented as a range of        acceptable unique read depth for 95% of the genomic regions        targeted by a probe, e.g., requiring that the sequencing        reaction generate a unique read depth for 95% of the targeted        regions of from 250 to 3000, from 500 to 3000, from 500 to 2500,        and the like.    -   A criterion for the PCR duplication rate of a sequencing        reaction, defined as the percentage of sequence reads that arise        from the same template molecule as at least one other sequence        read generated by the reaction. Generally, a PCR duplication        rate threshold will be selected based on the sequencing        technology used, the size of the targeted panel, and the        expected number of sequence reads generated by the combination        of the technology and targeted panel used. For example, in some        embodiments where next generation sequencing-by-synthesis        technology is used, the criterion is implemented as a minimum        PCR duplication rate threshold of at least 91%, 92%, 93%, 94%,        95%, 96%, 97%, 98%, 99%, or higher. In some embodiments, the        criterion is implemented as a range of acceptable PCR        duplication rates, e.g., of from 90% to 100%, from 90% to 99%,        from 91% to 99%, and the like.

Similarly, in some embodiments, the quality control system includesmethods that fail samples for reporting if various criteria are met. Insome embodiments, the system includes methods that allow for more manualreview if a sample does meet the criteria established for automaticfail. In some embodiments, the criteria for failing panel sequencingresults include one or more of the following:

-   -   A criterion for the on-target rate of the sequencing reaction,        defined as a comparison (e.g., a ratio) of (i) the number of        sequenced nucleotides or reads falling within the targeted panel        region of a genome and (ii) the number of sequenced nucleotides        or reads falling outside of the targeted panel region of the        genome. Generally, an on-target rate threshold will be selected        based on the sequencing technology used, the size of the        targeted panel, and the expected number of sequence reads        generated by the combination of the technology and targeted        panel used. For example, in some embodiments where next        generation sequencing-by-synthesis technology is used, the        criterion is implemented as a maximum on-target rate threshold        of no more than 30%, 40%, 50%, 60%, 70%, or greater. That is,        the criterion for failing the sample is satisfied when the        on-target rate for the sequencing reaction is below the maximum        on-target rate threshold. In some embodiments, the on-target        rate criteria is implemented as not falling within a range of        acceptable on-target rates, e.g., falling outside of an        on-target rate for a reaction of from 30% to 70%, from 30% to        80%, from 40% to 70%, from 40% to 80%, and the like.    -   A criterion for the number of total reads generated by the        sequencing reaction, including both unique sequence reads and        non-unique sequence reads. Generally, a total read number        threshold will be selected based on the sequencing technology        used, the size of the targeted panel, and the expected number of        sequence reads generated by the combination of the technology        and targeted panel used. For example, in some embodiments where        next generation sequencing-by-synthesis technology is used, the        criterion is implemented as a maximum number of total reads        threshold of no more than 100 million, 110 million, 120 million,        130 million, 140 million, 150 million, 160 million, 170 million,        180 million, 190 million, 200 million, or more total sequence        reads. That is, the criterion for failing the sample is        satisfied when the number of total reads for the sequencing        reaction is below the maximum total read threshold. In some        embodiments, the criterion is implemented as not falling within        a range of acceptable number of total reads, e.g., falling        outside of a range of from 50 million to 300 million total        sequence reads, from 100 million to 300 million sequence reads,        from 100 million to 200 million sequence reads, and the like.    -   A criterion for the number of unique reads generated by the        sequencing reaction. Generally, a unique read number threshold        will be selected based on the sequencing technology used, the        size of the targeted panel, and the expected number of sequence        reads generated by the combination of the technology and        targeted panel used. For example, in some embodiments where next        generation sequencing-by-synthesis technology is used, the        criterion is implemented as a maximum number of total reads        threshold of no more than 3 million, 4 million, 5 million, 6        million, 7 million, 8 million, 9 million, or more unique        sequence reads. That is, the criterion for failing the sample is        satisfied when the number of unique reads for the sequencing        reaction is below the maximum total read threshold. In some        embodiments, the criterion is implemented as not falling within        a range of acceptable number of unique reads, e.g., falling        outside of a range of from 2 million to 10 million total        sequence reads, from 3 million to 9 million sequence reads, from        3 million to 9 million sequence reads, and the like.    -   A criterion for unique read depth across the panel, defined as a        measure of central tendency (e.g., a mean or median) for a        distribution of the number of unique reads in the sequencing        reaction encompassing the genomic regions targeted by each        probe. Generally, a unique read depth threshold will be selected        based on the sequencing technology used, the size of the        targeted panel, and the expected number of sequence reads        generated by the combination of the technology and targeted        panel used. For example, in some embodiments where next        generation sequencing-by-synthesis technology is used, the        criterion is implemented as a maximum unique read depth        threshold of no more than 1500, 1750, 2000, 2250, 2500, 2750,        3000, 3250, 3500, or higher unique read depth. That is, the        criterion for failing the sample is satisfied when the unique        read depth across the panel for the sequencing reaction is below        the maximum total read threshold. In some embodiments, the        criterion is implemented as falling outside of a range of        acceptable unique read depth, e.g., falling outside of a unique        read depth range of from 1000 to 4000, from 1500 to 4000, from        1500 to 4000, and the like.    -   A criterion for the unique read depth of a lowest percentile        across the panel, defined as a measure of central tendency        (e.g., a mean or median) for a distribution of the number of        unique reads in the sequencing reaction encompassing the genomic        regions targeted by each probe that fall within the lowest        percentile of genomic regions by read depth (e.g., the first,        second, third, fourth, fifth, tenth, fifteenth, twentieth,        twenty-fifth, or similar percentile). Generally, a unique read        depth at a lowest percentile threshold will be selected based on        the sequencing technology used, the size of the targeted panel,        the lowest percentile selected, and the expected number of        sequence reads generated by the combination of the technology        and targeted panel used. For example, in some embodiments where        next generation sequencing-by-synthesis technology is used, the        criterion is implemented as a maximum unique read depth        threshold at the fifth percentile of no more than 500, 750,        1000, 1250, 1500, 1750, 2000, 2250, 2500, or higher unique read        depth. That is, the criterion for failing the sample is        satisfied when the unique read depth at a lowest percentile        threshold for the sequencing reaction is below the maximum        unique read depth at a lowest percentile threshold. In some        embodiments, the criterion is implemented as falling outside of        a range of acceptable unique read depth at the fifth percentile,        e.g., falling outside of a unique read depth at the fifth        percentile range of from 250 to 3000, from 500 to 3000, from 500        to 2500, and the like.    -   A criterion for the deamination or OxoG Q-score of a sequencing        reaction, defined as a Q-score for the occurrence of artifacts        arising from template oxidation/deamination. Generally, a        deamination or OxoG Q-score threshold will be selected based on        the sequencing technology used. For example, in some embodiments        where next generation sequencing-by-synthesis technology is        used, the criterion is implemented as a maximum deamination or        OxoG Q-score threshold of no more than 10, 20, 30, 40, 5,0 6,0        70, 80, 90, or higher. That is, the criterion for failing the        sample is satisfied when the deamination or OxoG Q-score for the        sequencing reaction is below the maximum deamination or OxoG        Q-score threshold. In some embodiments, the criterion is        implemented as falling outside of a range of acceptable        deamination or OxoG Q-scores, e.g., falling outside of a        deamination or OxoG Q-score range of from 10 to 100, from 10 to        90, and the like.    -   A criterion for the estimated contamination fraction is of a        sequencing reaction, defined as an estimate of the fraction of        template fragments in the sample being sequenced arising from        contamination of the sample, commonly expressed as a decimal,        e.g., where 1% contamination is expressed as 0.01. An example        method for estimating contamination in a sequencing method is        described in Jun G. et al., Am. J. Hum. Genet., 91:839-48        (2012). For example, in some embodiments, the criterion is        implemented as a minimum contamination fraction threshold of at        least 0.001, 0.0015, 0.002, 0.0025, 0.003, 0.0035, 0.004. That        is, the criterion for failing the sample is satisfied when the        contamination fraction for the sequencing reaction is above the        minimum contamination fraction threshold. In some embodiments,        the criterion is implemented as falling outside of a range of        acceptable contamination fractions, e.g., falling outside of a        contamination fraction range of from 0.0005 to 0.005, from        0.0005 to 0.004, from 0.001 to 0.004, and the like.    -   A criterion for the fingerprint correlation score of a        sequencing reaction, defined as a Pearson correlation        coefficient calculated between the variant allele fractions of a        set of pre-defined single nucleotide polymorphisms (SNPs) in two        samples. An example method for determining a fingerprint        correlation score is described in Sejoon L. et al., Nucleic        Acids Research, Volume 45, Issue 11, 20 Jun. 2017, Page e103.        For example, in some embodiments, the criterion is implemented        as a maximum fingerprint correlation score threshold of no more        than 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or higher.        That is, the criterion for failing the sample is satisfied when        the fingerprint correlation score for the sequencing reaction is        below the maximum fingerprint correlation score threshold. In        some embodiments, the criterion is implemented as falling        outside of a range of acceptable fingerprint correlation scores,        e.g., falling outside of a fingerprint correlation range of from        0.1 to 0.9, from 0.2 to 0.9, from 0.3 to 0.9, and the like.    -   A criterion for the raw coverage of a minimum percentage of the        genomic regions targeted by a probe, defined as a minimum number        of unique reads in the sequencing reaction encompassing each of        a minimum percentage (e.g., at least 80%, 85%, 90%, 95%, 98%,        99%, 99.5%, 99.9%, and the like) of the genomic regions targeted        by the probe panel. Generally, a raw coverage of a minimum        percentage of the genomic regions targeted by a probe threshold        will be selected based on the sequencing technology used, the        size of the targeted panel, the minimum percentage selected, and        the expected number of sequence reads generated by the        combination of the technology and targeted panel used. For        example, in some embodiments where next generation        sequencing-by-synthesis technology is used, the criterion is        implemented as a raw coverage of 95% of the genomic regions        targeted by a probe threshold of no more than 500, 750, 1000,        1250, 1500, 1750, 2000, 2250, 2500, or higher unique read depth.        That is, the criterion for failing the sample is satisfied when        the raw coverage of a minimum percentage of the genomic regions        targeted by a probe for the sequencing reaction is below the        maximum raw coverage of a minimum percentage of the genomic        regions targeted by a probe threshold. In some embodiments, the        criterion is implemented as falling outside of a range of        acceptable unique read depth for 95% of the genomic regions        targeted by a probe, e.g., requiring that the sequencing        reaction generate a unique read depth for 95% of the targeted        regions falling outside of a range of from 250 to 3000, from 500        to 3000, from 500 to 2500, and the like.    -   A criterion for the PCR duplication rate of a sequencing        reaction, defined as the percentage of sequence reads that arise        from the same template molecule as at least one other sequence        read generated by the reaction. Generally, a PCR duplication        rate threshold will be selected based on the sequencing        technology used, the size of the targeted panel, and the        expected number of sequence reads generated by the combination        of the technology and targeted panel used. For example, in some        embodiments where next generation sequencing-by-synthesis        technology is used, the criterion is implemented as a maximum        PCR duplication rate threshold of at least 91%, 92%, 93%, 94%,        95%, 96%, 97%, 98%, 99%, or higher. That is, the criterion for        failing the sample is satisfied when the PCR duplication rate        for the sequencing reaction is below the maximum PCR duplication        rate threshold. In some embodiments, the criterion is        implemented as falling outside of a range of acceptable PCR        duplication rates, e.g., of from 90% to 100%, from 90% to 99%,        from 91% to 99%, and the like.

Thresholds for the auto-pass and auto-fail criteria may be establishedwith reference to one another but are not necessarily set at the samelevel. For instance, in some embodiments, samples with a metric thatfalls between auto-pass and auto-fail criteria may be routed for manualreview by a qualified bioinformatics scientist. Samples that are failedeither automatically or by manual review may be routed to medical andlaboratory teams for final review and can be released for downstreamprocessing at the discretion of the laboratory medical director ordesignee.

Systems and Methods for Improved Validation of Somatic Sequence Variants

An overview of methods for providing clinical support for personalizedcancer therapy is described above with reference to FIGS. 2-4 above.Below, systems and methods for improving validation of somatic sequencevariants, e.g., within the context of the methods and systems describedabove, are described with reference to FIGS. 5A and 5B.

Many of the embodiments described below, in conjunction with FIGS. 5Aand 5B, relate to analyses performed using sequencing data for cfDNAobtained from a liquid biopsy sample of a cancer patient. Generally,these embodiments are independent and, thus, not reliant upon anyparticular DNA sequencing methods. However, in some embodiments, themethods described below include generating the sequencing data.

For example, provided herein is a generalized application of Bayes'Theorem through the likelihood ratio test for diagnostic assays thatallows dynamic calibration of filtering thresholds for somatic sequencevariant detection in a patient, in accordance with some embodiments ofthe present disclosure. These thresholds are based on sample specificerror rate, error rate from a pool of process matched healthy controlsamples, and/or a cohort of human solid tumors to inform our probabilitymodels. The method takes the form of the following formula:

${{odds}\left( {{post} - {test}} \right)} = {\left( {{odds}\left( {{pre} - {test}} \right)} \right) \times \left( \frac{sensitivity}{1 - {specificity}} \right)}$

where:

odds(post-test) is the post-test odds of a variant being positive giventhe application of Bayes Theorem,

odds(pre-test) is the pre-test odds of a positive given the cancer typeof the patient and the prevalence (measured as a fraction) ofalterations detected in that gene or within a specific genomic windowwithin a reference population with the cancer type,

sensitivity is the sensitivity bin nearest that measured for the assayat a proposed circulating variant fraction,

specificity is a term to be solved for, denoting the level ofuncertainty that is acceptable given some fixed value ofodds(post-test). Specificity can be replaced as the quantile of adistribution, such as a beta binomial distribution, (see below) definedby the within sample trinucleotide error rate and the background baseposition specific error rate,

d(beta-binomial) is a beta binomial distribution defined by specifiedparameters (alpha, beta, Pr), and

Min(AO) is the minimum number of alternate alleles observed for a givensample.

Given a fixed value for odds(post-test), it is possible to solve insteadfor specificity or, rather, the minimum acceptable quantile of the betabinomial error distribution. Therefore, the equation can be reframed as:

${specificity} = {1 - {\left( {({sensitivity}) \times \left( \frac{{odds}\left( {{pre} - {test}} \right)}{{odds}\left( {{post} - {test}} \right)} \right)} \right).}}$

Solving for specificity gives the quantile of the beta binomial functionwhich can then be plugged into quantile(beta-binomial) to derive aminimum number of alternative alleles observed at a given depth, or:

$\left. {{{Min}({AO})} = {{quantile}\left( {{1 - \left( {({sensitivity}) \times \left( \frac{{odds}\left( {{pre} - {test}} \right)}{{odds}\left( {{post} - {test}} \right)} \right)} \right)},{d({distribution})}} \right)}} \right),$

e.g., where d(distribution is d(beta-binomial).

Determining Pre-Test Probability:

In some embodiments, pre-test probability, which is related toodds(pre-test), is defined as:

${{{odds}\left( {{pre} - {test}} \right)} = \left( \frac{{probability}\left( {{pre} - {test}} \right)}{1 - {{probability}\left( {{pre} - {test}} \right)}} \right)},$

and is determined through historical data derived from matched solidtumor test data. By analyzing an extensive set of cancers and usingprocess matched liquid biopsy and tissue biopsy samples to identifysomatic variants with high confidence, it is possible to accuratelyassess the prevalence of specific variants within a population ofadvanced human cancers. For a population of patients most likely torequire liquid biopsy type tests, the sampling distribution most closelymodels the distribution into which any given patient receiving the testwill fall. To model this prevalence, there are two factors at play: genelevel prevalence, and genomic window level prevalence.

Assessing Prevalence by Sliding Window Segmentation:

In some embodiments, in order to get an accurate estimate of prevalence,it is critical to divide the estimated rate of mutation by the mechanismof disease. Gain of function (GOF) mutations tend to cluster in“hotspots,” whereas loss of function (LOF) mutations tend to bescattered throughout a gene and suppress or eliminate a protein's wildtype behaviors. Due to this evolutionary constraint on mutationposition, prevalence calculation must take into account whether a genehas a GOF or LOF mechanism of disease. While this cannot be directlyanalyzed given available data, it is possible to bootstrap thiscalculation by segmentation of mutational prevalence across exons.

Based on historical sequencing data, it is possible to bin mutations byexon. In order to assess whether a single exon is enriched for mutationsover the rest of the gene (a hotspot or GOF gene), a rolling Poissontest of difference is applied jumping from exon to exon. If there is asingle (or multiple) exons that show statistically significant deviationfrom other exons within the gene, that region is annotated as the windowof interest. Prevalence is subsequently calculated as prevalence withinthe exon(s) encompassing the window of interest.

If no exons can be shown to be over-represented for mutations, the geneis assumed to have an LOF mechanism of action and the prevalence for thewhole gene having an alteration within the specified cancer type isused. When a variant is being assessed for filtering, the prevalencewithin the pre-specified window or the prevalence within the gene itselfis used as the pre-test probability (Pr(pre-test)) for the likelihoodratio test.

Referring to Block 500, the present disclosure provides a method forvalidating a somatic sequence variant in a test subject having a cancercondition, at a computer system having one or more processors, andmemory storing one or more programs for execution by the one or moreprocessors.

Referring to Block 502, the method comprises obtaining, from a firstsequencing reaction, a corresponding sequence of each cell-free DNAfragment in a first plurality of cell-free DNA fragments in a liquidbiopsy sample of the test subject, thus obtaining a first plurality ofsequence reads, e.g., a plurality of de-duplicated sequence reads, whereeach sequence read correspond to a unique cell-free DNA fragment fromthe sample. In some embodiments, the first plurality of sequence readsincludes at least 1000 sequence reads. In some embodiments, the firstplurality of sequence reads includes at least 10,000 sequence reads. Insome embodiments, the first plurality of sequence reads includes atleast 100,000 sequence reads. In some embodiments, the first pluralityof sequence reads includes at least 200,000, 300,000, 400,000, 500,000,750,000, 1,000,000, 2,500,000, 5,000,000 sequence reads, or more.

In some embodiments, the liquid biopsy sample is blood. In someembodiments, the liquid biopsy sample comprises blood, whole blood,peripheral blood, plasma, serum, or lymph of the test subject.

In some embodiments, the cancer condition is a particular type and stageof cancer (e.g., stage 2 lung cancer). Advantageously, the variantfiltering methods described herein are superior to filtering methodsthat simply account for the tumor fraction of a sample. This isachieved, in part, by accounting for the types of mutations found in aparticular type of cancer, which improves the quality of the pre-oddsprobability of finding a particular type of variant (e.g., a variantwithin a particular genomic region) in a sample from a subject with aknown type of cancer. Accordingly, in some embodiments, the pre-oddsprobabilities are based on as specific of a cancer type as possible,e.g., accounting for one or more of a type of cancer, an origin of thecancer, the stage of the cancer, any previously known genomic variantsin the cancer (e.g., whether a breast cancer subject is BRCA1 or BRCA2positive), a personal characteristic of the subject, e.g., age, gender,race, smoking status, alcohol consumption status, etc.), any pathologyclassification of the cancer, etc. However, there are practicalconsiderations when determining the level of specificity for which asubject's cancer should be specified when matching the cancer to atraining cohort. For instance, when an insufficient number of trainingsamples from matching samples are available for calculation of pre-testodds, the specificity of the cancer classification should be reduced inorder to provide a large enough sample of training data to providemeaningful prior information.

In some embodiments, the test subject, the liquid biopsy sample, thecancer condition, and/or methods and systems for obtaining,accessioning, storing, processing, preparing and/or analyzing thereof,comprise any of the embodiments as described above in the presentdisclosure with reference to FIGS. 2-4.

In some embodiments, the first sequencing reaction is a panel-enrichedsequencing reaction. For example, in some embodiments, the firstsequencing reaction is a panel-enriched sequencing reaction of a firstplurality of enriched loci, and each respective locus in the pluralityof enriched loci are sequenced at an average unique sequence depth of atleast 250×. In some such embodiments, each respective locus in theplurality of enriched loci are sequenced at an average unique sequencedepth of at least 1000×. In some embodiments, the first plurality ofsequence reads is obtained from ultra-high depth sequencing (e.g., whereeach locus in a plurality of loci are sequenced at an average coverageof at least 1000×, at least 2500×, or at least 5000×). Example genesthat are informative for precision oncology, e.g., when implemented in aliquid biopsy-based assay, are shown in Table 1. In some embodiments, apanel-enriched sequencing reaction described herein uses a probe setthat includes at least 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 75, 80,90, 100, or all 105 of the genes listed in Table 1.

In some embodiments, the first sequencing reaction is a whole genomesequencing reaction, and the average sequencing depth of the reactionacross the genome is at least 5×, 10×, 15×, 20×, 25×, 30×, 40×, 50×, orhigher.

In some embodiments, the first plurality of sequence reads includes atleast 50,000 sequence reads, at least 100,000 sequence reads, at least250,000 sequence reads, at least 500,000 sequence reads, at least1,000,000 sequence reads, at least 5,000,000 sequence reads, or more.

In some embodiments, the first sequencing reaction and/or the firstplurality of sequence reads includes any of the embodiments as describedabove in the present disclosure. For example, in some embodiments,methods and systems for nucleic acid extraction, library preparation,capture and hybridization, pooling, sequencing, aligning, normalizationand/or other sequence read processing comprise any of the embodiments asdescribed above in the present disclosure with reference to FIGS. 2-4.

Referring to Block 504, the method further comprises aligning eachrespective sequence read in the first plurality of sequence reads to areference sequence for the species of the subject thus identifying (i) avariant allele fragment count for a candidate variant, where thecandidate variant maps to a locus in the reference sequence, and (ii) alocus fragment count for the locus encompassing the candidate variant.In some embodiments, the variant allele fragment count refers to aunique number of sequence reads in the test subject that encompass thecandidate variant. In some embodiments, the locus fragment count refersto the number of sequence reads in the test subject that map to therespective locus encompassing the candidate variant.

As described above, in some embodiments, the reference sequence is areference genome, e.g., a reference human genome. In some embodiments, areference genome has several blacklisted regions, such that thereference genome covers only about 75%, 80%, 85%, 90%, 95%, 98%, 99%,99.5%, or 99.9% of the entire genome for the species of the subject. Insome embodiments, the reference sequence for the subject covers at least10% of the entire genome for the species of the subject, or at least15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, or moreof the entire genome for the species of the subject. In someembodiments, the reference sequence for the subject represents a partialor whole exome for the species of the subject. For instance, in someembodiments, the reference sequence for the subject covers at least 10%of the exome for the species of the subject, or at least 15%, 20%, 25%,30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%,98%, 99%, 99.9%, or 100% of the exome for the species of the subject. Insome embodiments, the reference sequence covers a plurality of loci thatconstitute a panel of genomic loci, e.g., a panel of genes used in apanel-enriched sequencing reaction. An example of genes useful forprecision oncology, e.g., which may be targeted with such a panel, areshown in Table 1. Accordingly, in some embodiments, the referencesequence for the subject covers at least 100 kb of the genome for thespecies of the subject. In other embodiments, the reference sequence forthe subject covers at least 250 kb, 500 kb, 750 kb, 1 Mb, 2 Mb, 5 Mb, 10Mb, 25 Mb, 50 Mb, 100 Mb, 250 Mb, or more of the genome for the speciesof the subject. However, in some embodiments, there is no sizelimitation of the reference sequence. For example, in some embodiments,the reference sequence can be a sequence for a single locus, e.g., asingle exon, gene, etc.) within the genome for the species of thesubject.

Referring to Block 506, the method further comprises comparing thevariant allele fragment count for the candidate variant against adynamic variant count threshold for the locus in the reference sequencethat the candidate variant maps to. The dynamic variant count thresholdis based upon a pre-test odds of a positive variant call for the locusbased upon the prevalence of variants in a genomic region that includesthe locus from a first set of nucleic acids obtained from a cohort ofsubjects having the cancer condition.

For example, in some embodiments, the dynamic variant count threshold isdetermined based on the number of sequence variants that map to therespective locus, obtained from a sequencing of nucleic acids from acohort of subjects having the cancer condition (e.g., a baseline variantthreshold). In some embodiments, the cohort of subjects having thecancer condition are matched to at least one personal characteristic ofthe test subject (e.g., age, gender, race, smoking status, averagealcohol consumption, other underlying medical conditions, etc.).

In some embodiments, the dynamic variant count threshold is also basedupon a sequencing error rate for the sequencing reaction. For example,in some such embodiments, the sequencing error rate for the sequencingreaction is a trinucleotide sequencing error rate. In some embodiments,the dynamic variant count threshold is also based upon a backgroundsequencing error rate determined for the locus.

Referring to Block 508, in some embodiments, the method furthercomprises obtaining a distribution of variant detection sensitivities asa function of circulating variant allele fraction from the cohort ofsubjects. The distribution of variant detection sensitivities is basedon the circulating variant allele fraction of a second set of nucleicacids collected from the cohort of subjects relative to variant allelesdetected in the first set of nucleic acids collected from the cohort ofsubjects. The first set of nucleic acids are from solid tumor biopsiesof the cohort of subjects, and the second set of nucleic acids arecell-free nucleic acids from liquid biopsies of the cohort of subjects.

FIG. 6 illustrates a flow chart of a method 600 for obtaining adistribution of variant detection sensitivities as a function ofcirculating variant allele fraction from a cohort of subjects, inaccordance with some embodiments of the present disclosure. For example,referring to Block 602, matched liquid biopsy and solid tumor samplesare obtained from a set of training subjects. In some embodiments, thetraining subjects comprise any of the cancer conditions, personalcharacteristics, and/or feature data described above in the presentdisclosure. Furthermore, in some embodiments, obtaining the matchedliquid biopsy and solid tumor samples comprise any of the methods andembodiments described above in the present disclosure.

Referring to Block 604, the solid tumor sample is sequenced (e.g., byextracting nucleic acids from the solid tumor sample and performing asequencing reaction for the sample). The plurality of sequence readsobtained from sequencing the solid tumor sample are aligned to areference genome (e.g., a human reference genome), thus determining anysequence variants included in the solid tumor sample. Referring to Block606, the liquid biopsy sample is sequenced as described above for thesolid tumor sample, thus determining any sequence variants included inthe liquid biopsy sample.

Referring to Block 608, the results of the sequencing reactions arecompared by comparing the sequence variants detected in the liquidbiopsy sample against the sequence variants detected in the solid tumorsample (e.g., a measure of how many of the variants detected in thesolid tumor sample were also detected in the liquid biopsy sample, or acirculating variant allele fraction). The comparison determines avariant detection sensitivity for each variant (e.g., corresponding to arespective locus) in the liquid biopsy sample. Referring to Block 610,each variant detection sensitivity is binned, in a plurality of bins,with respect to an estimated tumor fraction for the liquid biopsysample, thus obtaining a distribution of variant detectionsensitivities.

In some embodiments, a distribution of variant detection sensitivitiesis established based on a set of training samples (e.g., sensitivitydistribution training data) with known variant allele fractions, e.g.,samples derived from a solid tumor sample for which one or more variantallele fraction has been determined (e.g., by deep sequencing of thesample). For example, in some embodiments, nucleic acids from each of aplurality of training samples 181 having a known variant allele fraction184 for one or more variant alleles 183 is sequenced according to aprocessed-matched sequencing reaction (e.g., using a substantiallyidentical or identical sequencing reaction), and it is determinedwhether each sequence variant can be detected, e.g., defining adetection status 185 for each locus/variant 183. Over a large number oftraining samples, a specificity of detection of variants havingdifferent variant allele fractions can be determined. In someembodiments, the specificity is determined on a locus-by-locus basis,such that the specificity of detection is specific for the genomicregion or locus encompassing the candidate sequence variant. In someembodiments, the specificity is determined globally, e.g., not on alocus-by-locus basis.

Referring again to Block 508, in some embodiments, the method comprisesestimating a circulating variant fraction for the candidate variant. Insome embodiments, the circulating variant fraction for the candidatevariant is the ratio of the variant allele fragment count to the locusfragment count (e.g., the proportion of sequence reads that include thecandidate variant in the plurality of sequence reads that map to therespective locus encompassing the variant). In some embodiments, thecirculating variant fraction is based only upon the variant allelefrequency for that locus. In some alternative embodiments, thecirculating variant fraction is a circulating tumor fraction determinedfor the sample.

For example, in some embodiments, the circulating variant fraction isspecific to the variant being validated. In some such embodiments, theestimated variant fraction is determined by calculating the percentageof sequence reads encompassing the locus that include the variant (e.g.,a variant allele fraction).

In some embodiments, the estimated circulating variant fraction for thecandidate variant is an estimated tumor fraction for the sample, wherethe estimated tumor fraction for the sample is estimated based on asecond sequencing reaction comprising low-pass whole-genome methylationsequencing of a second plurality of cell-free DNA fragments in theliquid biopsy sample of the test subject.

In some such embodiments, the dynamic threshold for the locus is setbased upon a desired variant detection specificity determined by therelationship:

${specificity} = {1 - \left( {({sensitivity}) \times \left( \frac{{odds}\left( {{pre} - {test}} \right)}{{odds}\left( {{post} - {test}} \right)} \right)} \right)}$

where sensitivity is the variant detection sensitivity in thedistribution of variant detection sensitivities that corresponds to thecirculating variant fraction for the candidate variant, odds(post-test)is the post-test odds of a positive variant call for the locus, andodds(pre-test) is the pre-test odds of the positive variant call for thelocus.

In some embodiments, the specificity is used to select a quantile of abeta-binomial distribution of the minimal variant allele fragment countrequired to support a positive variant call for the locus, thus definingthe dynamic threshold for the locus. The beta-binomial distribution isdefined by the sequencing error rate for the sequencing reaction and thebackground sequencing error rate determined for the locus. For example,in some embodiments, the minimum number of alternative alleles requiredto validate a positive variant call is represented by

${{Min}({AO})} = {{quantile}\left( {{1 - \left( {({sensitivity}) \times \left( \frac{{odds}\left( {{pre} - {test}} \right)}{{odds}\left( {{post} - {test}} \right)} \right)} \right)},{d\left( {{beta} - {binomial}} \right)}} \right)}$

In some embodiments, as described in FIG. 6, obtaining the distributionof variant detection sensitivities comprises binning variant detectionsensitivities in a plurality of bins as a function of circulatingvariant allele fraction. Each bin in the plurality of bins is associatedwith a corresponding variant detection sensitivity and sensitivity isthe variant detection sensitivity corresponding to the respective bin,in the plurality of bins that encompasses the circulating variantfraction for the candidate variant. In some alternative embodiments, thedistribution of variant detection sensitivities is a continuousfunction.

Additional details and embodiments for obtaining thresholds forfiltering variants (e.g., dynamic thresholds) are described above in thepresent disclosure (see, Example Methods: Variant Identification).

In some embodiments, the pre-test odds of a positive variant call forthe locus is based on (i) the prevalence of variants in the genomicregion that includes the locus from the first set of nucleic acidsobtained from the cohort of subjects having the cancer condition (e.g.,the percentage of patients with the particular cancer type that have avariant in the region of interest), and (ii) a known or inferred effectof the variants. When the known or inferred effect of a variant isloss-of-function (LOF) of a gene that includes the locus, the genomicregion used to compute the pre-test probability is the entire gene, andwhen the known or inferred effect of a variant is gain-of-function (GOF)of the gene that includes the locus, the genomic region used to computethe pre-test probability is the exon, of the gene, that includes thelocus.

In some such embodiments, the effect of the variants is inferred bybinning each respective variant of the variants in the genomic regionthat includes the locus from the first set of nucleic acids obtainedfrom the cohort of subjects having the cancer condition into arespective bin, in a plurality of bins for the gene that include thelocus, corresponding to the exon encompassing the respective variant inthe gene. Each bin in the plurality of bins corresponds to a differentexon of the respective gene. After determining whether any bin in theplurality of bins contains significantly more variants than the otherbins in the plurality of bins, the effect of the sequence variant isinferred to be a gain-of-function of the gene when a bin containssignificantly more variants than the other bins in the plurality ofbins. Alternatively, the effect of the sequence variant is inferred tobe a loss-of-function of the gene when no bin in the plurality of binscontains significantly more sequence variants than the other bins in theplurality of bins.

For example, FIG. 7 illustrates a method of inferring an effect of asequence variant as a gain-of-function or a loss-of-function of a gene,in accordance with some embodiments of the present disclosure.

FIG. 7A illustrates a gene 700-A with a plurality of exons 701-A, 702-A,703-A. Each exon corresponds to a bin in a plurality of bins. A firstexon 701-A comprises a region of interest (e.g., a locus) thatencompasses a candidate variant. A plurality of sequence variants (e.g.,704-A, 705-A, 706-A, 707-A, 708-A, 709-A) is obtained from a sequencingof nucleic acids from a cohort of subjects, where each sequence variantmaps to a respective locus in the gene. The effect of the variants isinferred by binning each sequence variant into the respective bincorresponding to the exon to which the respective variant maps. Thus,sequence variants 704-A, 705-A, 706-A, and 707-A are binned into the bincorresponding to exon 701-A, sequence variant 708-A is binned into thebin corresponding to exon 702-A, and sequence variant 709-A is binnedinto the bin corresponding to exon 703-A. In FIG. 7A, it can bedetermined that the bin corresponding to exon 701-A containssignificantly more variants than the other bins in the plurality ofbins, and thus the effect of the sequence variant is inferred to be again-of-function of the gene. In such case, the genomic region used tocompute the pre-test probability is the exon 701-A of the gene, thatincludes the locus encompassing the candidate variant.

Alternatively, FIG. 7B illustrates a gene 700-B with a plurality ofexons 701-B, 702-B, 703-B. Each exon corresponds to a bin in a pluralityof bins. A first exon 701-B comprises a region of interest (e.g., alocus) that encompasses a candidate variant. A plurality of sequencevariants (e.g., 704-B, 705-B, 706-B, 707-B, 708-B, 709-B) is obtainedfrom a sequencing of nucleic acids from a cohort of subjects, where eachsequence variant maps to a respective locus in the gene. The effect ofthe variants is inferred by binning each sequence variant into therespective bin corresponding to the exon to which the respective variantmaps. Thus, sequence variants 704-B and 705-B are binned into the bincorresponding to exon 701-B, sequence variant 706-B and 707-B are binnedinto the bin corresponding to exon 702-B, and sequence variant 708-B and709-B are binned into the bin corresponding to exon 703-B. In FIG. 7B,it can be determined that no bin in the plurality of bins containssignificantly more sequence variants than the other bins in theplurality of bins, and thus the effect of the sequence variant isinferred to be a loss-of-function of the gene. In such case, the genomicregion used to compute the pre-test probability is the entire gene.

In some such embodiments, determining whether any bin in the pluralityof bins contains significantly more variants than the other bins in theplurality of bins comprises applying a rolling Poisson test ofdifference between bin counts corresponding to adjacent exons in thegene.

Referring to Block 510, the method further comprises validating thepresence of the somatic sequence variant in the test subject when thevariant allele fragment count for the candidate variant satisfies thedynamic variant count threshold for the locus, or rejecting the presenceof the somatic sequence variant in the test subject when the variantallele fragment count for the candidate variant does not satisfy thedynamic variant count threshold for the locus. In some embodiments, thevalidating includes other variant filtering criteria, as described abovein the present disclosure (see, Example Methods: VariantIdentification).

In some embodiments, the methods and systems disclosed herein are usedfor precision oncology applications. For example, in some embodiments,the method further comprises generating a report for the test subjectcomprising the identity of variant alleles having variant allele counts,in the first sequencing reaction, that satisfy the dynamic variant countthreshold. In some embodiments, the generated report further comprisestherapeutic recommendations for the test subject based on the identityof one or more of the reported variant alleles. Additional embodimentsfor precision oncology applications, including matched clinical trials,matched therapies, report generation, and/or other aspects of thedigital and laboratory health care platform are described in detailbelow.

Another aspect of the present disclosure provides a computer systemcomprising one or more processors and a non-transitory computer-readablemedium including computer-executable instructions that, when executed bythe one or more processors, cause the processors to perform a methodaccording to any one of the embodiments disclosed herein.

Another aspect of the present disclosure provides a non-transitorycomputer-readable storage medium having stored thereon program codeinstructions that, when executed by a processor, cause the processor toperform the method according to any one of the embodiments disclosedherein.

In some embodiments, the methods described herein include generating aclinical report 139-3 (e.g., a patient report), providing clinicalsupport for personalized cancer therapy, and/or using the informationcurated from sequencing of a liquid biopsy sample, as described above.In some embodiments, the report is provided to a patient, physician,medical personnel, or researcher in a digital copy (for example, a JSONobject, a pdf file, or an image on a website or portal), a hard copy(for example, printed on paper or another tangible medium). A reportobject, such as a JSON object, can be used for further processing and/ordisplay. For example, information from the report object can be used toprepare a clinical laboratory report for return to an orderingphysician. In some embodiments, the report is presented as text, asaudio (for example, recorded or streaming), as images, or in anotherformat and/or any combination thereof.

The report includes information related to the specific characteristicsof the patient's cancer, e.g., detected genetic variants, epigeneticabnormalities, associated oncogenic pathogenic infections, and/orpathology abnormalities. In some embodiments, other characteristics of apatient's sample and/or clinical records are also included in thereport. For example, in some embodiments, the clinical report includesinformation on clinical variants, e.g., one or more of copy numbervariants (e.g., for actionable genes CCNE1, CD274 (PD-L1), EGFR, ERBB2(HER2), MET, MYC, BRCA1, and/or BRCA2), fusions, translocations, and/orrearrangements (e.g., in actionable genes ALK, ROS1, RET, NTRK1, FGFR2,FGFR3, NTRK2 and/or NTRK3), pathogenic single nucleotide polymorphisms,insertion-deletions (e.g., somatic/tumor and/or germline/normal),therapy biomarkers, microsatellite instability status, and/or tumormutational burden.

Variant Characterization

In some embodiments, a predicted functional effect and/or clinicalinterpretation for one or more identified variants is curated by usinginformation from variant databases. In some embodiments, aweighted-heuristic model is used to characterize each variant.

In some embodiments, identified clinical variants are labeled as“potentially actionable”, “biologically relevant”, “variants of unknownsignificance (VUSs)”, or “benign”. Potentially actionable alterationsare protein-altering variants with an associated therapy based onevidence from the medical literature. Biologically relevant alterationsare protein-altering variants that may have functional significance orhave been observed in the medical literature but are not associated witha specific therapy. Variants of unknown significance (VUSs) areprotein-altering variants exhibiting an unclear effect on functionand/or without sufficient evidence to determine their pathogenicity. Insome embodiments, benign variants are not reported. In some embodiments,variants are identified through aligning the patient's DNA sequence tothe human genome reference sequence version hg19 (GRCh37). In someembodiments, actionable and biologically relevant somatic variants areprovided in a clinical summary during report generation.

For instance, in some embodiments, variant classification and reportingis performed, where detected variants are investigated followingcriteria from known evolutionary models, functional data, clinical data,literature, and other research endeavors, including tumor organoidexperiments. In some embodiments, variants are prioritized andclassified based on known gene-disease relationships, hotspot regionswithin genes, internal and external somatic databases, primaryliterature, and other features of somatic drivers. Variants can be addedto a patient (or sample, for example, organoid sample) report based onrecommendations from the AMP/ASCO/CAP guidelines. Additional guidelinesmay be followed. Briefly, pathogenic variants with therapeutic,diagnostic, or prognostic significance may be prioritized in the report.Non-actionable pathogenic variants may be included as biologicallyrelevant, followed by variants of uncertain significance. Translocationsmay be reported based on features of known gene fusions, relevantbreakpoints, and biological relevance. Evidence may be curated frompublic and private databases or research and presented as 1) consensusguidelines 2) clinical research, or 3) case studies, with a link to thesupporting literature. Germline alterations may be reported as secondaryfindings in a subset of genes for consenting patients. These may includegenes recommended by the American College of Medical Genetics andGenomics (ACMG) and additional genes associated with cancerpredisposition or drug resistance.

In some embodiments, a clinical report 139-3 includes information aboutclinical trials for which the patient is eligible, therapies that arespecific to the patient's cancer, and/or possible therapeutic adverseeffects associated with the specific characteristics of the patient'scancer, e.g., the patient's genetic variations, epigeneticabnormalities, associated oncogenic pathogenic infections, and/orpathology abnormalities, or other characteristics of the patient'ssample and/or clinical records. For example, in some embodiments, theclinical report includes such patient information and analysis metrics,including cancer type and/or diagnosis, variant allele fraction, patientdemographic and/or institution, matched therapies (e.g., FDA approvedand/or investigational), matched clinical trials, variants of unknownsignificance (VUS), genes with low coverage, panel information, specimeninformation, details on reported variants, patient clinical history,status and/or availability of previous test results, and/or version ofbioinformatics pipeline.

In some embodiments, the results included in the report, and/or anyadditional results (for example, from the bioinformatics pipeline), areused to query a database of clinical data, for example, to determinewhether there is a trend showing that a particular therapy was effectiveor ineffective in treating (e.g., slowing or halting cancerprogression), and/or adverse effects of such treatments in otherpatients having the same or similar characteristics.

In some embodiments, the results are used to design cell-based studiesof the patient's biology, e.g., tumor organoid experiments. For example,an organoid may be genetically engineered to have the samecharacteristics as the specimen and may be observed after exposure to atherapy to determine whether the therapy can reduce the growth rate ofthe organoid, and thus may be likely to reduce the growth rate of cancerin the patient associated with the specimen. Similarly, in someembodiments, the results are used to direct studies on tumor organoidsderived directly from the patient. An example of such experimentation isdescribed in U.S. Provisional Patent Application No. 62/944,292, filedDec. 5, 2019, the content of which is hereby incorporated by reference,in its entirety, for all purposes.

As illustrated in FIG. 2A, in some embodiments, a clinical report ischecked for final validation, review, and sign-off by a medicalpractitioner (e.g., a pathologist). The clinical report is then sent foraction (e.g., for precision oncology applications).

Digital and Laboratory Health Care Platform:

In some embodiments, the methods and systems described herein areutilized in combination with, or as part of, a digital and laboratoryhealth care platform that is generally targeted to medical care andresearch. It should be understood that many uses of the methods andsystems described above, in combination with such a platform, arepossible. One example of such a platform is described in U.S. patentapplication Ser. No. 16/657,804, filed Oct. 18, 2019, which is herebyincorporated herein by reference in its entirety for all purposes.

For example, an implementation of one or more embodiments of the methodsand systems as described above may include microservices constituting adigital and laboratory health care platform supporting analysis ofliquid biopsy samples to provide clinical support for personalizedcancer therapy. Embodiments may include a single microservice forexecuting and delivering analysis of liquid biopsy samples to clinicalsupport for personalized cancer therapy or may include a plurality ofmicroservices each having a particular role, which together implementone or more of the embodiments above. In one example, a firstmicroservice may execute sequence analysis in order to deliver genomicfeatures to a second microservice for curating clinical support forpersonalized cancer therapy based on the identified features. Similarly,the second microservice may execute therapeutic analysis of the curatedclinical support to deliver recommended therapeutic modalities,according to various embodiments described herein.

Where embodiments above are executed in one or more micro-services withor as part of a digital and laboratory health care platform, one or moreof such micro-services may be part of an order management system thatorchestrates the sequence of events as needed at the appropriate timeand in the appropriate order necessary to instantiate embodiments above.A microservices-based order management system is disclosed, for example,in U.S. Prov. Patent Application No. 62/873,693, filed Jul. 12, 2019,which is hereby incorporated herein by reference in its entirety for allpurposes.

For example, continuing with the above first and second microservices,an order management system may notify the first microservice that anorder for curating clinical support for personalized cancer therapy hasbeen received and is ready for processing. The first microservice mayexecute and notify the order management system once the delivery ofgenomic features for the patient is ready for the second microservice.Furthermore, the order management system may identify that executionparameters (prerequisites) for the second microservice are satisfied,including that the first microservice has completed, and notify thesecond microservice that it may continue processing the order to curateclinical support for personalized cancer therapy, according to variousembodiments described herein.

Where the digital and laboratory health care platform further includes agenetic analyzer system, the genetic analyzer system may includetargeted panels and/or sequencing probes. An example of a targeted panelis disclosed, for example, in U.S. Prov. Patent Application No.62/902,950, filed Sep. 19, 2019, which is incorporated herein byreference and in its entirety for all purposes. In one example, targetedpanels may enable the delivery of next generation sequencing results forproviding clinical support for personalized cancer therapy according tovarious embodiments described herein. An example of the design ofnext-generation sequencing probes is disclosed, for example, in U.S.Prov. Patent Application No. 62/924,073, filed Oct. 21, 2019, which isincorporated herein by reference and in its entirety for all purposes.

Where the digital and laboratory health care platform further includes abioinformatics pipeline, the methods and systems described above may beutilized after completion or substantial completion of the systems andmethods utilized in the bioinformatics pipeline. As one example, thebioinformatics pipeline may receive next-generation genetic sequencingresults and return a set of binary files, such as one or more BAM files,reflecting nucleic acid (e.g., cfDNA, DNA and/or RNA) read countsaligned to a reference genome. The methods and systems described abovemay be utilized, for example, to ingest the cfDNA, DNA and/or RNA readcounts and produce genomic features as a result.

When the digital and laboratory health care platform further includes anRNA data normalizer, any RNA read counts may be normalized beforeprocessing embodiments as described above. An example of an RNA datanormalizer is disclosed, for example, in U.S. patent application Ser.No. 16/581,706, filed Sep. 24, 2019, which is incorporated herein byreference and in its entirety for all purposes.

When the digital and laboratory health care platform further includes agenetic data deconvoluter, any system and method for deconvoluting maybe utilized for analyzing genetic data associated with a specimen havingtwo or more biological components to determine the contribution of eachcomponent to the genetic data and/or determine what genetic data wouldbe associated with any component of the specimen if it were purified. Anexample of a genetic data deconvoluter is disclosed, for example, inU.S. patent application Ser. No. 16/732,229 and PCT/US19/69161, filedDec. 31, 2019, U.S. Prov. Patent Application No. 62/924,054, filed Oct.21, 2019, and U.S. Prov. Patent Application No. 62/944,995, filed Dec.6, 2019, each of which is hereby incorporated herein by reference and inits entirety for all purposes.

When the digital and laboratory health care platform further includes anautomated RNA expression caller, RNA expression levels may be adjustedto be expressed as a value relative to a reference expression level,which is often done in order to prepare multiple RNA expression datasets for analysis to avoid artifacts caused when the data sets havedifferences because they have not been generated by using the samemethods, equipment, and/or reagents. An example of an automated RNAexpression caller is disclosed, for example, in U.S. Prov. PatentApplication No. 62/943,712, filed Dec. 4, 2019, which is incorporatedherein by reference and in its entirety for all purposes.

The digital and laboratory health care platform may further include oneor more insight engines to deliver information, characteristics, ordeterminations related to a disease state that may be based on geneticand/or clinical data associated with a patient and/or specimen.Exemplary insight engines may include a tumor of unknown origin engine,a human leukocyte antigen (HLA) loss of homozygosity (LOH) engine, atumor mutational burden engine, a PD-L1 status engine, a homologousrecombination deficiency engine, a cellular pathway activation reportengine, an immune infiltration engine, a microsatellite instabilityengine, a pathogen infection status engine, and so forth. An exampletumor of unknown origin engine is disclosed, for example, in U.S. Prov.Patent Application No. 62/855,750, filed May 31, 2019, which isincorporated herein by reference and in its entirety for all purposes.An example of an HLA LOH engine is disclosed, for example, in U.S. Prov.Patent Application No. 62/889,510, filed Aug. 20, 2019, which isincorporated herein by reference and in its entirety for all purposes.An example of a tumor mutational burden (TMB) engine is disclosed, forexample, in U.S. Prov. Patent Application No. 62/804,458, filed Feb. 12,2019, which is incorporated herein by reference and in its entirety forall purposes. An example of a PD-L1 status engine is disclosed, forexample, in U.S. Prov. Patent Application No. 62/854,400, filed May 30,2019, which is incorporated herein by reference and in its entirety forall purposes. An additional example of a PD-L1 status engine isdisclosed, for example, in U.S. Prov. Patent Application No. 62/824,039,filed Mar. 26, 2019, which is incorporated herein by reference and inits entirety for all purposes. An example of a homologous recombinationdeficiency engine is disclosed, for example, in U.S. Prov. PatentApplication No. 62/804,730, filed Feb. 12, 2019, which is incorporatedherein by reference and in its entirety for all purposes. An example ofa cellular pathway activation report engine is disclosed, for example,in U.S. Prov. Patent Application No. 62/888,163, filed Aug. 16, 2019,which is incorporated herein by reference and in its entirety for allpurposes. An example of an immune infiltration engine is disclosed, forexample, in U.S. patent application Ser. No. 16/533,676, filed Aug. 6,2019, which is incorporated herein by reference and in its entirety forall purposes. An additional example of an immune infiltration engine isdisclosed, for example, in U.S. Patent Application No. 62/804,509, filedFeb. 12, 2019, which is incorporated herein by reference and in itsentirety for all purposes. An example of an MSI engine is disclosed, forexample, in U.S. patent application Ser. No. 16/653,868, filed Oct. 15,2019, which is incorporated herein by reference and in its entirety forall purposes. An additional example of an MSI engine is disclosed, forexample, in U.S. Prov. Patent Application No. 62/931,600, filed Nov. 6,2019, which is incorporated herein by reference and in its entirety forall purposes.

When the digital and laboratory health care platform further includes areport generation engine, the methods and systems described above may beutilized to create a summary report of a patient's genetic profile andthe results of one or more insight engines for presentation to aphysician. For instance, the report may provide to the physicianinformation about the extent to which the specimen that was sequencedcontained tumor or normal tissue from a first organ, a second organ, athird organ, and so forth. For example, the report may provide a geneticprofile for each of the tissue types, tumors, or organs in the specimen.The genetic profile may represent genetic sequences present in thetissue type, tumor, or organ and may include variants, expressionlevels, information about gene products, or other information that couldbe derived from genetic analysis of a tissue, tumor, or organ. Thereport may include therapies and/or clinical trials matched based on aportion or all of the genetic profile or insight engine findings andsummaries. For example, the therapies may be matched according to thesystems and methods disclosed in U.S. Prov. Patent Application No.62/804,724, filed Feb. 12, 2019, which is incorporated herein byreference and in its entirety for all purposes. For example, theclinical trials may be matched according to the systems and methodsdisclosed in U.S. Prov. Patent Application No. 62/855,913, filed May 31,2019, which is incorporated herein by reference and in its entirety forall purposes.

The report may include a comparison of the results to a database ofresults from many specimens. An example of methods and systems forcomparing results to a database of results are disclosed in U.S. Prov.Patent Application No. 62/786,739, filed Dec. 31, 2018, which isincorporated herein by reference and in its entirety for all purposes.The information may be used, sometimes in conjunction with similarinformation from additional specimens and/or clinical responseinformation, to discover biomarkers or design a clinical trial.

When the digital and laboratory health care platform further includesapplication of one or more of the embodiments herein to organoidsdeveloped in connection with the platform, the methods and systems maybe used to further evaluate genetic sequencing data derived from anorganoid to provide information about the extent to which the organoidthat was sequenced contained a first cell type, a second cell type, athird cell type, and so forth. For example, the report may provide agenetic profile for each of the cell types in the specimen. The geneticprofile may represent genetic sequences present in a given cell type andmay include variants, expression levels, information about geneproducts, or other information that could be derived from geneticanalysis of a cell. The report may include therapies matched based on aportion or all of the deconvoluted information. These therapies may betested on the organoid, derivatives of that organoid, and/or similarorganoids to determine an organoid's sensitivity to those therapies. Forexample, organoids may be cultured and tested according to the systemsand methods disclosed in U.S. patent application Ser. No. 16/693,117,filed Nov. 22, 2019; U.S. Prov. Patent Application No. 62/924,621, filedOct. 22, 2019; and U.S. Prov. Patent Application No. 62/944,292, filedDec. 5, 2019, each of which is incorporated herein by reference and inits entirety for all purposes.

When the digital and laboratory health care platform further includesapplication of one or more of the above in combination with or as partof a medical device or a laboratory developed test that is generallytargeted to medical care and research, such laboratory developed test ormedical device results may be enhanced and personalized through the useof artificial intelligence. An example of laboratory developed tests,especially those that may be enhanced by artificial intelligence, isdisclosed, for example, in U.S. Provisional Patent Application No.62/924,515, filed Oct. 22, 2019, which is incorporated herein byreference and in its entirety for all purposes.

It should be understood that the examples given above are illustrativeand do not limit the uses of the systems and methods described herein incombination with a digital and laboratory health care platform.

The results of the bioinformatics pipeline may be provided for reportgeneration 208. Report generation may comprise variant science analysis,including the interpretation of variants (including somatic and germlinevariants as applicable) for pathogenic and biological significance. Thevariant science analysis may also estimate microsatellite instability(MSI) or tumor mutational burden. Targeted treatments may be identifiedbased on gene, variant, and cancer type, for further consideration andreview by the ordering physician. In some aspects, clinical trials maybe identified for which the patient may be eligible, based on mutations,cancer type, and/or clinical history. Subsequent validation may occur,after which the report may be finalized for sign-out and delivery. Insome embodiments, a first or second report may include additional dataprovided through a clinical dataflow 202, such as patient progressnotes, pathology reports, imaging reports, and other relevant documents.Such clinical data is ingested, reviewed, and abstracted based on apredefined set of curation rules. The clinical data is then populatedinto the patient's clinical history timeline for report generation.

Further details on clinical report generation are disclosed in U.S.patent application Ser. No. 16/789,363 (PCT/US20/180002), filed Feb. 12,2020, which is hereby incorporated herein by reference in its entirety.

Specific Embodiments of the Disclosure

In some aspects, the systems and methods disclosed herein may be used tosupport clinical decisions for personalized treatment of cancer. Forexample, in some embodiments, the methods described herein identifyactionable genomic variants and/or genomic states with associatedrecommended cancer therapies. In some embodiments, the recommendedtreatment is dependent upon whether or not the subject has a particularactionable variant and/or genomic status. Recommended treatmentmodalities can be therapeutic drugs and/or assignment to one or moreclinical trials. Generally, current treatment guidelines for variouscancers are maintained by various organizations, including the NationalCancer Institute and Merck & Co., in the Merck Manual.

In some embodiments, the methods described herein further includesassigning therapy and/or administering therapy to the subject based onthe identification of an actionable genomic variant and/or genomicstate, e.g., based on whether or not the subject's cancer will beresponsive to a particular personalized cancer therapy regimen. Forexample, in some embodiments, when the subject's cancer is classified ashaving a first actionable variant and/or genomic state, the subject isassigned or administered a first personalized cancer therapy that isassociated with the first actionable variant and/or genomic state, andwhen the subject's cancer is classified as having a second actionablevariant and/or genomic state, the subject is assigned or administered asecond personalized cancer therapy that is associated with the secondactionable variant. Assignment or administration of a therapy or aclinical trial to a subject is thus tailored for treatment of theactionable variants and/or genomic states of the cancer patient.

EXAMPLES Example 1—the Cancer Genome Atlas (TCGA)

The Cancer Genome Atlas (TCGA) is a publicly available datasetcomprising more than two petabytes of genomic data for over 11,000cancer patients, including clinical information about the cancerpatients, metadata about the samples (e.g., the weight of a sampleportion, etc.) collected from such patients, histopathology slide imagesfrom sample portions, and molecular information derived from the samples(e.g., mRNA/miRNA expression, protein expression, copy number, etc.).The TCGA dataset includes data on 33 different cancers: breast (breastductal carcinoma, bread lobular carcinoma) central nervous system(glioblastoma multiforme, lower grade glioma), endocrine (adrenocorticalcarcinoma, papillary thyroid carcinoma, paraganglioma &pheochromocytoma), gastrointestinal (cholangiocarcinoma, colorectaladenocarcinoma, esophageal cancer, liver hepatocellular carcinoma,pancreatic ductal adenocarcinoma, and stomach cancer), gynecologic(cervical cancer, ovarian serous cystadenocarcinoma, uterinecarcinosarcoma, and uterine corpus endometrial carcinoma), head and neck(head and neck squamous cell carcinoma, uveal melanoma), hematologic(acute myeloid leukemia, Thymoma), skin (cutaneous melanoma), softtissue (sarcoma), thoracic (lung adenocarcinoma, lung squamous cellcarcinoma, and mesothelioma), and urologic (chromophobe renal cellcarcinoma, clear cell kidney carcinoma, papillary kidney carcinoma,prostate adenocarcinoma, testicular germ cell cancer, and urothelialbladder carcinoma).

Example 2—Method of Validating a Liquid Biopsy Assay

Conducting Sample Collection, Storage, Nucleic Acid Isolation, andLibrary Preparation.

To validate a liquid biopsy assay in accordance with some embodiments ofthe present disclosure, 188 unique specimens were sequenced. Theseunique specimens included 10 blood specimens purchased from BioIVT, 56residual plasma samples, 39 whole-blood samples, 4 cfDNA referencestandards set in synthetic plasma (Horizon Discovery's Multiplex I cfDNAReference Standards HD812, HD813, HD814, HD815), and 2 cfDNA referencestandard isolates (Horizon Discovery's Structural Multiplex cfDNAreference standard HD786, and 100% Multiplex I Wild Type ReferenceStandard HD776). Furthermore, an additional 55 blood samples withmatched tumor samples were utilized to compare the liquid biopsy andsolid tumor tests, and 375 blood samples were sequenced for low-passwhole-genome sequencing (LPWGS) analysis. Sequence data from anadditional 1,000 patient samples that were previously sequenced wereutilized for retrospective and clinical analyses. All blood was receivedin Cell-free DNA BCT® blood collection tubes (Streck). Plasma wasprepared immediately after accessioning and stored at −80° C. untillater nucleic acid extraction and library preparation. At this time,cfDNA was isolated from plasma using the Qiagen QIAamp MinElute ccfDNAMidi Kit (QIAGEN), conducted according to instructions provided by themanufacturer. Automated library preparation was performed on a SciCloneNGSx (Perkin Elmer). All cfDNA samples were normalized with moleculargrade water to a maximum of 50 microliters (μL).

Conducting the Liquid Biopsy Sequencing Assay.

The liquid biopsy assay utilized New England BioLab's NEBNext® Ultra™ IIDNA Library Prep Kit for Illumina®, IDT's xGen CS Adapters, uniquemolecular indices (UMI), and 96 pairs of barcodes to prepare cfDNAsequencing libraries with unique sample identifiers (IDs). Each samplewas ligated to a dual unique index. The dual unique index enablesmultiplexed sequencing of up to 7 patients and 1 positive control per SPNovaSeq flow cell, 16 patients and 1 positive control per S1 NovaSeqflow cell, 34 patients and 1 positive control per S2 NovaSeq flow cell,and 84 patients and 1 positive control per S4 NovaSeq flow cell. Thelibrary preparation protocol is optimized for greater than or equal to20 nanograms (ng) cfDNA input to maximize mutation detectionsensitivity. The final library was sequenced on an Illumina NovaSeqsequencer. Furthermore, analysis was performed using a bioinformaticspipeline and analysis server.

The Bioinformatics Pipeline.

Adapter-trimmed FASTQ files are aligned to the nineteenth edition of thehuman reference genome build (hg19) using Burrows-Wheeler Aligner (BWA).Li et al., 2009, “Fast and accurate short read alignment withBurrows-Wheeler transform,” Bioinformatics, (25), pg. 1754. Followingalignment, reads were grouped by alignment position and UMI family, andcollapsed into consensus sequences using fgbio tools (available onlineat fulcrumgenomics.github.io/fgbio/). Bases with insufficient quality orsignificant disagreement among family members were reverted to N's.Phred scores were scaled based on initial base calling estimatescombined across all family members. Following single-strand consensussequence generation, duplex consensus sequences were generated bycomparing the forward and reverse oriented PCR products with mirroredUMI sequences. Consensus sequences were re-aligned to the humanreference genome using BWA. BAM files are generated and indexed afterthe re-alignment.

SNV and indel variants were detected using VarDict. Lai et al., 2016,“VarDict: a novel and versatile variant caller for next-generationsequencing in cancer research,” Nucleic Acids Res, (44), pg. 108. SNVswere called down to 0.1% VAF for specified hotspot target regions and0.25% VAF at all other base positions across the panel. Indels werecalled down to 0.5% VAF for variants within specific regions ofinterest. Any indels outside of these regions were called down to 5%VAF. All SNVs and indels were then sorted, deduplicated, normalized, andannotated accordingly. Following annotation, variants were classified asgermline, somatic, or uncertain using a Bayesian model based on priorexpectations informed by various internal and external databases ofgermline and cancer variants. Uncertain variants are treated as somaticfor filtering and reporting purposes. Following classification, variantswere filtered based on a plurality of quality metrics includingcoverage, VAF, strand bias, and genomic complexity. Additionally,variants were filtered with a Bayesian tri-nucleotide context-basedmodel with position level background error rates estimated from a poolof process matched healthy controls. Furthermore, known artifactualvariants were removed.

Copy number variants (CNVs) were analyzed utilizing CNVkit and a CNVannotation and filtering algorithm provided by the present disclosure.Talevich et al., 2016, “CNVkit: Genome-Wide Copy Number Detection andVisualization from Targeted DNA Sequencing,” PLoS Comput Biol, (12), pg.1004873. This CNVkit provides genomic region binning, coveragecalculation, bias correction, normalization to a reference pool,segmentation, and visualization. The log₂ ratios between the tumorsample and a pool of process matched healthy samples from the CNVkitoutput were annotated and filtered using statistical models, such thatthe amplification status (e.g., amplified or not-amplified) of each geneis predicted and non-focal amplifications are removed.

Rearrangements were detected using the SpeedSeq analysis pipeline.Chiang et al., 2015, “SpeedSeq: ultra-fast personal genome analysis andinterpretation,” Nat Methods, (12), pg. 966. Briefly, FASTQ files werealigned to hg19 using BWA. Split reads mapped to multiple positions andread pairs mapped to discordant positions were identified and separated,then utilized to detect gene rearrangements by LUMPY. Layer et al.,2014, “I. M. LUMPY: a probabilistic framework for structural variantdiscovery,” Genome Biol, (15), pg. 84. Fusions were then filteredaccording to the number of supporting reads.

Predicted functional effect and clinical interpretation for each variantwas curated by automated software using information from both internaland external databases. A weighted-heuristic model was used, which haslogic-based recommendations from the AMP/ASCO/CAP/ClinGen Somaticworking group and ACMG guidelines. Li et al., 2017, “Standards andGuidelines for the Interpretation and Reporting of Sequence Variants inCancer: A Joint Consensus Recommendation of the Association forMolecular Pathology, American Society of Clinical Oncology, and Collegeof American Pathologists,” The Journal of molecular diagnostics, (19),pg. 4; Kalia et al., 2017, “Recommendations for reporting of secondaryfindings in clinical exome and genome sequencing, 2016 update (ACMG SFv2.0): a policy statement of the American College of Medical Geneticsand Genomics,” Genetics in Medicine, (19), pg. 249.

The relative frequency and distribution are determined for any readcontaining repetitive sequences to detect microsatellite instability. Topredict the probability of an unstable locus, a k-nearest neighborsmodel (with k=100) was utilized along with normalized percent lower,mean lower, and mean log-likelihood metrics. The percentage of unstableloci was calculated from the probabilities of each sample, with greaterthan 50% unstable loci considered microsatellite instability-high(MSI-H).

The Validation Approach.

The present disclosure conducted extensive validation studies toestablish robust technical perform of the liquid biopsy assay. Limit ofdetection (LOD) was determined by assessing analytical sensitivity inreference standards with 5%, 1%, 0.5%, 0.25%, and 0.1% VAF generatedfrom the Horizon Discovery reference set. The Horizon Discovery setincludes 160 bp cfDNA fragments from human cell lines in an artificialplasma matrix to closely resemble cfDNA extracted from human plasma.VAFs of SNVs and indels, including EGFR (ΔE746-A750), EGFR(V769-D770insASV), EGFR A767_V769dup, EGFR (L858R), EGFR (T790M), KRAS(G12D), NRAS (A59T), NRAS (Q61K), AKT1 E17K, PIK3CA (E545K), and GNA11Q209L, and CNVs and rearrangements, including CCDC6/RET, SLC34A2/ROS1,MET, MYC, and MYCN, were measured in reference samples by the liquidbiopsy assay of the present disclosure. Each measurement was conductedwith a minimum of three replicates at 10 ng, 30 ng, and 50 ng of DNA.Sensitivity was determined by the number of detected variants divided bythe total number of variants present in the reference samples. Sampleswith an on-target rate of less than 30% were excluded from the instantanalysis, and MET (4.5 copies) was included in CNV sensitivitydeterminations. Sensitivity of greater than 90% was considered reliabledetection.

Analytical specificity was determined using 44 normal samples titratedat 1%, 2.5%, or 5% from a wild-type cfDNA reference standard with a listof confirmed true-negative SNVs, indels, CNVs and rearrangements.Specificity was determined by the number of known true-negative variantsdivided by the number of true-negative variants plus false-positivevariants identified by the liquid biopsy assay.

To assess inter-instrument concordance between the sequencinginstruments, 10 patient libraries were sequenced on each instrument (3NovaSeqs). Variants seen below the lower limit of detection (LLOD)(0.25% for SNVs and 0.50% for indels) were excluded from concordanceanalysis.

To establish analytical accuracy, the results of 40 validation sampleswere compared to the results of an orthogonal reference method (Roche'sAVENIO ctDNA assay). Analytical accuracy was determined by the number ofdetected variants divided by the total number of variants present in thesample. Variants that were off-target or below LLOD (0.25% for SNVs and0.5% for indels) were excluded from the instant analysis.

Conducting Digital Droplet Polymerase Chain Reaction (ddPCR).

Five variants were validated on the ddPCR platform: KRAS G12D(Integrated DNA Technologies, IDT, published sequences); TERT promotermutations c.-124C>T (C228T) & c.-146C>T (C250T) (Thermo FisherScientific); and TP53 p.R273H and TP53 p.R175H (Thermo FisherScientific). Each amplification reaction was performed in 25 μL andcontained 1× Genotyping Master Mix (Thermo Fisher Scientific), 1×droplet stabilizer (RainDance), 1× of primer/probe mixture for TERT andTP53 (for KRAS: 800 nM of each primer and 500 nM of each probe) plustemplate. To improve the lower limit of detection, 4-cycle amplificationwas conducted prior to droplet generation. Amplification for KRAS wasconducted using the cycling conditions of: 1 cycle of 95° C. (0.6° C./sramp) for 10 minutes, 4 cycles of 95° C. (0.6° C./s ramp) for 15 secondsand 60° C. for 2 minutes, followed by 1 cycle of 98° C. (0.6° C./s ramp)for 10 minutes. Cycling conditions for the TP53 variants were the sameas those for KRAS with the exception of the annealing and extensiontemperature, which was set at 55° C. for 2 minutes. Amplification forTERT followed Thermo Fisher's recommendation as follows: 1 cycle of 96°C. (1.6° C./s ramp) for 10 minutes, 4 cycles of 98° C. (1.6° C./s ramp)for 30 seconds and 55° C. for 2 minutes, followed by 1 cycle of 55° C.(1.6° C./s ramp) for 2 minutes. Accordingly, droplets generated on theRainDance Source, and amplification performed following the abovecycling conditions with cycle numbers of 45 for both KRAS and TP53, and54 for TERT. Furthermore, droplets were analyzed on a RainDance Sensedroplet reader. Additionally, RainDrop Analyst II v1.1.0 analysissoftware was utilized to acquire and analyze data.

The Concordance Between Liquid Biopsy and Solid Tumor Assays.

Matched liquid biopsy and solid tumor sample pairs (n=55) were used todetermine analytical sensitivity and specificity. Solid tumor andmatched normal samples obtained from peripheral blood buffy coat wereanalyzed with the solid tumor assay, and corresponding blood plasmasamples were analyzed with the liquid biopsy assay of the presentdisclosure. Only variants in the reportable range of both the solidtumor and liquid biopsy panels were included in these analyses (e.g.,genes in the liquid biopsy gene panel is a subset of genes in the solidtumor gene panel). Germline, intronic, and synonymous variantsidentified in the solid tumor assay and the liquid biopsy assay wereexcluded from analysis with the exception of intronic splice variants.To determine analytical sensitivity, the number of variants called inboth the liquid biopsy assay and the solid tumor assay (e.g., truepositives) was divided by the sum of true positives and those calledonly in the solid tumor assay. To determine analytical specificity thenumber of positions reported in neither the liquid biopsy assay nor thesolid tumor assay (e.g., true negatives) was divided by the sum of truenegatives and variants only called in the liquid biopsy assay.

To improve variant calling in the liquid biopsy assay, a strategy thatdynamically determines local sequence errors using Bayes Theorem and thelikelihood ratio test was developed. The dynamic threshold wasdetermined using a sample-specific error rate, the error rate fromhealthy control samples, and from a reference cohort of solid tumorsamples. Accordingly, the method of the present disclosure was conductedon 55 matched liquid biopsy/solid tumor tissue samples, with variantsdetected in the solid tumor assay as the source of truth. Usingsensitivity thresholds defined by the LOD analysis, fixed post-test-odds(e.g., equal to the P(post-test)/[1−P(post-test)]), as well aspre-test-odds. The Pre-test-odds were determined using historical datafrom the solid tumor assay with an equation identical to thepost-test-odds calculation). Accordingly, the following formula wasdetermined based on the above:specificity=1−pre-test-odds*sensitivity/post-test-odds

The specificity was input to a beta-binomial function and yielded theminimum number of alternate alleles to call a variant at a particulardepth. The pre-test-odds metric was specific to individual cancercohorts and individual genes, allowing for cancer-specific pre-test-oddsto be applied to individual exons.

Conducting Low-Pass Whole Genome Sequencing and Analysis.

Blood samples from 375 patients were sequenced using low-passwhole-genome sequencing (LPWGS) across four flow cells. Sequencingcoverage metrics for these samples were determined using PicardCollectWgsMetrics. The tumor fraction and ploidy values for each samplewere estimated using ichorCNA with a specific reference panel of 47normal samples. Adalsteinsson et al., (2017), “Scalable whole-exomesequencing of cell-free DNA reveals high concordance with metastatictumors” Nat Commun, (8), pg. 1324. Reported variants from thecorresponding liquid biopsy analysis of each sample were utilized toassess the accuracy of the tumor fraction estimates.

Determining Estimation of Circulating Tumor Fraction.

Circulating tumor fraction estimate (ctFE) was determined using a novelmethod, Off-Target Tumor Estimation Routine (OTTER), from off-targetreads uniformly distributed across the human reference genome. Asdescribed above, the CNVkit was conducted on each sample, and segmentswere assigned via circular binary segmentation (CBS). Olshen et al.,2004, “Circular binary segmentation for the analysis of array-based DNAcopy number data,” Biostatistics, (5), pg. 557. Segments were then fitto integer copy states via an expectation-maximization algorithm usingthe sum of squared error of the segment log₂ ratios (e.g., normalized togenomic interval size) to expected ratios given a putative copy stateand tumor purity. Estimates were confirmed by comparing results againstLPWGS of the original patient isolate. As such, results are shown usingrandomly selected, de-identified samples.

Clinical Profiling of Liquid Biopsy Patients.

De-identified molecular and abstracted clinical data were evaluated in acohort of 1,000 patients randomly selected from a specific referenceclinicogenomic database. All data were de-identified in accordance withthe Health Insurance Portability and Accountability Act (HIPAA). Datesused for analyses were relative to the first liquid biopsy sequencingdate of each patient, and year of the first sequencing date was randomlyoff-set. Variants included in the analyses were those classified aspathogenic or likely pathogenic, and further divided into actionable ifmatched to diagnostic, prognostic or therapeutic evidence orbiologically relevant. Outcomes were determined according to the mostrecent clinical response noted in patient records. The study protocolwas submitted to the Advarra Institutional Review Board (IRB), whichdetermined the research was exempt from IRB oversight and approved awaiver of HIPAA authorization for this study.

Example 3—Results of Validating Liquid Biopsy Assay

Liquid Biopsy Validation Summary.

The liquid biopsy oncology assay is a 105-gene hybrid capture NGS paneldesigned to detect actionable somatic variant targets in plasma.Referring to FIGS. 16A through 16C, the liquid biopsy assay detectsmutations in four variant classes, including: single nucleotide variants(SNVs) and insertion-deletions (indels) in all 105 genes, copy numbervariants (CNVs) in 6 genes, and chromosomal rearrangements in 7 genes.To validate the liquid biopsy assay, a total of 188 samples weresequenced. The runs generated an average of 261.7 M±40.7 M total readswith 130.7 M±20.3 M read pairs and a unique median read depth of4999.128±1288.843. The average percent of mapped reads across all runswas 99.876%±0.0078.

Referring to FIG. 13, determined analytical sensitivity for all SNVs,indels, CNVs, and rearrangements targeted in the reference samples isprovided. Accordingly, SNVs were reliably detected at greater than orequal to 0.25% VAF with 30 ng of input DNA (93.75% [45/48] sensitivity),indels at greater than or equal to 0.5% VAF with 30 ng (95.83% [23/24]sensitivity), CNVs at greater than or equal to 0.5% VAF with 10 ng(100.00% [8/8] sensitivity), and rearrangements at greater than or equalto 1% VAF with 30 ng (90% [9/10] sensitivity). Referring to FIG. 14,analytical specificity is provided in which 100% for SNVs, indels, andrearrangements; and 96.2% for CNVs on samples with greater than or equalto 0.25% VAF with 30 ng of input DNA.

Accordingly, intra-assay and inter-assay concordance between thereplicates in the present disclosure was 100% for SNVs, indicating ahigh degree of repeatability and reproducibility. Moreover, theinter-instrument concordance was 96.70% for SNVs and 100% for indels,with a combined concordance of 96.83% across instruments. Additionally,interfering substances including genomic DNA, ethanol, and isopropanoldid not cause a change in the detection of variants. Concordance betweencontrols and samples with interfering substances was high (e.g., 100%)among samples that passed filtering, and were above the LOD.

The Accuracy of the Liquid Biopsy Assay Compared to Orthogonal Assays.

Referring to FIG. 15, to evaluate analytical accuracy, the presentdisclosure compared the liquid biopsy assay to the Roche AVENIO ctDNAassay. In 30 ng cfDNA samples analyzed by liquid biopsy assay and AVENIOcfDNA assay (n=40), sensitivity for SNVs, indels, CNVs andrearrangements was 94.8%, 100%, 100%, and 100%, respectively. In the 6SNVs that were not detected, 5 were called but filtered out due toinsufficient evidence. In 10 ng samples, sensitivity for SNV, indel,CNV, and rearrangements was 91.9%, 100%, 80%, and 100%, respectively. Ofthe 7 SNVs that were not detected, 6 were present in sequencing data butfiltered out due to insufficient evidence.

Referring to FIGS. 8A and 8B, to further validate the liquid biopsyassay results, patients with reported variants KRAS G12D (n=12), TERTc.-124 (n=7), TERT c.-146 (n=5), TP53 R273H (n=7), and TP53 R175H (n=7)were selected for analysis by ddPCR. Liquid biopsy NGS VAF was comparedwith ddPCR VAF to determine concordance. Accordingly, 100% PPV and ahigh correlation between ddPCR results and liquid biopsy VAF (R²=0.892),as well as individual variants such as KRAS G12D (R²=0.970), as shown inFIGS. 8A and 8B. These results indicate the liquid biopsy assay of thepresent disclosure can be used to accurately identify hotspot mutations.Specifically, FIG. 8A illustrates results of an inter-assay comparisonbetween liquid biopsy, ddPCR, and solid tumor results for patientssamples with selected variants (n=38) analyzed by ddPCR and comparedwith liquid biopsy variant allele fraction (VAF), resulting in highcorrelation overall (R²=0.892). FIG. 8B illustrates results of aninter-assay comparison between liquid biopsy, ddPCR, and solid tumorresults for patient samples with individual variants such as KRAS G12D(n=12, R²=0.970).

The Concordance Between Liquid Biopsy and Solid Tumor Tissue Assay.

Comparisons between analytical sensitivity and specificity in matchedsolid tumor and liquid biopsy tests from 55 patients were determined.Since solid tumor matched samples include both tumor tissue and buffycoat (e.g., normal comparator), a specific classification strategy wasutilized to determine and exclude germline variants from the analysis.Beaubier et al., 2019, “Clinical validation of the xT next-generationtargeted oncology sequencing assay,” Onctotarget, 10(24), pg. 2384.Removing intronic and synonymous variants, benign and likely benignvariants, as well as variants below the LOD for solid tumor and liquidbiopsy assays resulted in 145 concordant SNVs, 20 concordant indels, and11 concordant CNVs. 66 SNVs, 11 indels, and 8 CNVs were identified thatwere reported in the solid tumor assay but not the liquid biopsy assay,as well as 209 SNVs, 14 indels, and 7 CNVs that were reported in theliquid biopsy assay but not the solid tumor assay. Accordingly, thespecificity of the liquid biopsy assay was 100.00% for SNVs and indelsand 96.67% for CNVs. Referring to FIG. 17, a Bayesian dynamic filteringmethodology was utilized to further reduce discordance by 11.45%,improving the specificity of variant calling in the liquid biopsy assay.The overall sensitivity of the liquid biopsy assay compared to the solidtumor assay was 68.18% for SNVs and indels and 57.89% for CNVs. Whenlimiting analysis to clinically actionable targets, 107 concordantvariants and 37 discordant, for a sensitivity of 74.31%, were reported.

Furthermore, comparisons between the sample classification of reportablevariants between matched samples with liquid biopsy and solid tumortesting were determined. Referring to FIG. 8C, variants were consideredCH variants if found in the plasma as well as in the solid tumor normalsample but were not present at levels consistent with germlinevariation. Accordingly, this classification of germline and CH variantsin liquid biopsy is possible with a corresponding solid tumor assay or agermline sequencing analysis from the buffy coat. Notably, two sampleshave a large number of variants only detected in liquid biopsy, many ofwhich are at low VAFs. These samples were subsequently determined tohave very high tumor mutational burdens (TMBs) in their correspondingsolid tumor analyses. Accordingly, the large number of liquid biopsyvariants at low VAFs and high TMBs suggest that these tumors may be moreheterogeneous and that some variants are more easily detected in blood.Specifically, FIG. 8C illustrates results of an inter-assay comparisonbetween liquid biopsy, ddPCR, and solid tumor results for sampleclassification of reportable variants, in which microsatelliteinstability (MSI) was detected by the liquid biopsy assay in six out ofsixteen MSI-high patents, with 100% as indicated by the one or more bluedots depicted above the dotted line.

Finally, liquid biopsy validation samples were utilized to assessmicrosatellite instability in patients whose MSI status was previouslyconfirmed by a specific reference clinically validated solid tumor MSItest or immunohistochemistry. Referring to FIG. 8D, the liquid biopsyassay reported MSI-H status in 37.5% (6/16) of orthogonally confirmedMSI-H patients at 100% (6/6) positive predictive value. Accordingly,comparisons between the solid tumor and liquid biopsy assays demonstratethe strengths of the liquid biopsy assay and the added value of usingmultiple assays to detect genomic drivers of cancer. Specifically, FIG.8D illustrates results from liquid biopsy and solid tumor assayscompared in patients who received both tests (n=55) of FIG. 8A and FIG.8B, in which the percent circulating tumor DNA VAF, depicted above thedashed line, and number of reportable variants detected, depicted belowthe dashed line, for each individual patient were categorized by assaytype and CHIP or germline status.

OTTER, a Novel Method for Estimating Tumor Fraction.

An accurate measure of tumor fraction can provide an improvedunderstanding of variants identified through liquid biopsy testing. Inthe present disclosure, a novel method, Off-Target Tumor EstimationRoutine (OTTER), for determining a more accurate circulating tumorfraction estimate (ctFE) was developed. Referring to FIGS. 9A and 9B,comparisons between OTTER ctFE with VAFs from 1,000 random patientsamples across cancer types were determined, such that liquid biopsyctFE correlates with max pathogenic VAF and median VAF. Referring toFIGS. 9C through 9F, removing germline variants and amplified regionsfrom these analyses further increased the correlation. Plausible liquidbiopsy ctFE estimates are expected to be greater than or equal to themaximal somatic VAF in a sample that is not on an amplified region.Referring to FIG. 9H, overall, after removing germline variants andvariants on amplified regions, 90.8% of median VAFs are less than orequal to the corresponding liquid biopsy ctFEs. Referring to FIG. 9H,the distribution of liquid biopsy ctFE for the liquid biopsy 1,000cohort is provided. Accordingly, the median ctFE was 0.07 with a meanctFE of 0.12.

In addition to VAF, LPWGS is increasingly utilized to estimate tumorfractions and thought to be a more accurate measure than VAF.Adalsteinsson et al., 2017; Chen et al., 2019, “Next-generationsequencing in liquid biopsy: cancer screening and early detection,” HumGenomics, (13), pg. 34. Referring to FIG. 9G, comparisons between LPWGSichorCNA-predicted circulating tumor fraction to the OTTER ctFE inmatched patient samples (n=375) determined a strong correlation betweenmethods (R²=0.843, P=4.71e−152). Accordingly, this correlation indicatesthat OTTER ctFEs are highly concordant with estimates using LPWGS butcan be determined directly from the targeted-panel sequencing withoutrequiring additional sequencing.

Specifically, FIG. 9A illustrates results from circulating tumorfraction estimate (ctFE) and variant allele fraction (VAF) in which ctFEof liquid biopsy-sequenced patients (n=1,000) was correlated with maxpathogenic VAF (R²=0.38). FIG. 9B illustrates results from ctFE and VAFin which ctFE of liquid biopsy-sequenced patients (n=1,000) wascorrelated with medium VAF (R²=0.35). FIG. 9C illustrates results fromctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n=1,000)in which germline variants were removed, increasing the correlation withmax pathogenic VAF (R²=0.40). FIG. 9D illustrates results from ctFE andVAF in which ctFE of liquid biopsy-sequenced patients (n=1,000) in whichgermline variants were removed, without increasing the correlation withmedium VAF (R²=0.35). FIG. 9E illustrates results from ctFE and VAF inwhich ctFE of liquid biopsy-sequenced patients (n=1,000) in whichamplified regions from these analyses were removed, increasing thecorrelation with max pathogenic VAF (R²=0.41). FIG. 9F illustratesresults from ctFE and VAF in which ctFE of liquid biopsy-sequencedpatients (n=1,000) in which amplified regions from these analyses wereremoved, increasing the correlation with medium VAF (R²=0.36). FIG. 9Gillustrates results from ctFE and VAF in which ctFE of liquidbiopsy-sequenced patients (n=1,000) in which samples that also underwentlow-pass whole genome sequencing (LPWGS, n=375), a strong correlationbetween LPWGS-predicted tumor fraction and ctFE (R²=0.843) is found.Furthermore, FIG. 9G illustrates results from ctFE and VAF in which ctFEof liquid biopsy-sequenced patients (n=1,000) and the overalldistribution of ctFE across the cohort (median ctFE=0.07, meanctFE=0.12, and standard deviation=0.15).

Retrospective Clinical Profiling of the Liquid Biopsy Assay Against a1,000-Subject Cohort.

To evaluate the clinical utility of the liquid biopsy, de-identifiedmolecular and clinical data from 1,000 samples across cancer types wereselected for clinical profiling. This included 55.7% female and 44.3%male patients, with a median age of 66 years, and interquartile range of15. Referring to FIG. 18, this cohort included patients from 24 cancercategories, with breast (n=254), colorectal (n=98), lung (n=241),pancreatic (n=83), and prostate (n=96) being the most common. Referringto FIG. 10A, the median ctFE predicted by OTTER was 0.07 for all cancertypes, with the exception of prostate, which was 0.06. Referring to FIG.10B, in this cohort, 8,099 mutations were reported, of which 2,732 werepathogenic, and 2,238 were clinically actionable. Specifically, FIG. 10Aillustrates circulating tumor fraction estimate (ctFE) and mutationallandscape by cancer type, in which median ctFE among the most commoncancer types was 0.07, with the exception of prostate (ctFE=0.06). FIG.10B illustrates circulating tumor fraction estimate (ctFE) andmutational landscape by cancer type, in which variants are categorizedas reportable, pathogenic, or actionable. Across all patients, the mostcommonly mutated gene was TP53. The heatmap was normalized within rowsto depict the most prevalent variants detected for each common cancertype in the cohort (breast n=254, colorectal n=98, lung n=241,pancreatic n=83, and prostate n=96).

Accordingly, the most frequently mutated gene in the liquid biopsy 1,000cohort was TP53 (51.1% of patients). The most commonly mutated geneswere TP53, PIK3CA, ESR1, BRCA2, NF1, ATM and APC in breast cancer, TP53,EGFR, ATM and KRAS in lung cancer, and TP53, APC, and KRAS in colorectalcancer. These findings are consistent with existing literature oncommonly mutated genes in each cancer type and suggest the liquid biopsytest accurately detects variants of interest to the broader cancercommunity. van Helden et al, 2019; Dal Maso et al., 2019; Savli et al.,2019, “TP53, EGFR and PIK3CA gene variations observed as prominentbiomarkers in breast and lung cancer by plasma cell-free DNA genomictesting,” J Biotechnol, (300), pg. 87; Cheng et al., 2019, “LiquidBiopsy Detects Relapse Five Months Earlier than Regular ClinicalFollow-Up and Guides Targeted Treatment in Breast Cancer,” Case RepOncol Med, pg. 6545298; Keup et al., 2019, “Targeted deep sequencingrevealed variants in cell-free DNA of hormone receptor-positivemetastatic breast cancer patients,” Cell Mol Life Sci, print.; Li etal., 2019, “Genomic profiling of cell-free circulating tumor DNA inpatients with colorectal cancer and its fidelity to the genomics of thetumor biopsy,” J Gastrointest Oncol, (10), pg. 831.

Advanced Disease is Associated with Higher Estimated Tumor Fraction.

A goal of liquid biopsy assays of the present disclosure is to moreefficiently monitor treatment response and predict disease progressionin patients over time. To establish proof of concept, the association ofctFE with advanced disease states was investigated. Accordingly,referring to FIG. 11A, a significant difference in ctFE between stages(P=2.97e−5) was determined. However, since the majority of patients hadadvanced disease at the time of testing, more early stage samples arenecessary to further verify these findings. Referring to FIG. 11B, ctFEin patients with metastatic disease was evaluated to determined thatctFE increases when distant sites are affected. Indeed, referring toFIG. 11C, patients with no metastatic lesions had a significantly lowerctFE than patients with one or more distant sites (P=4.77e−7), furtherhighlighting the potential of ctFE for disease monitoring. Specifically,FIG. 11A illustrates circulating tumor fraction estimate (cfTE)according to stage and number of distant metastases among the liquidbiopsy 1,000 cohort, in which there was a significant difference in ctFEbetween stages (Kruskal-Wallis P=2.97e−5). Accordingly, patients withstage 4 cancer (n=879, median ctFE=0.07) had a higher ctFE than thosewith stages 1 (n=20, median ctFE=0.06), 2 (n=25, median ctFE=0.06), or 3(n=76, median ctFE=0.06). FIGS. 11B and 11C illustrate that ctFEincreased with the number of metastatic distant sites (Mann-Whitney Utest P=7.57e−7), and there was a significant difference in ctFE betweenpatients with no metastatic lesions (n=116) and those with 1 or moredistant sites affected (n=884, Mann-Whitney U test P=2.12e−5). Thesensitivity and specificity shown to the right-hand side of the FIG. 11Crepresent the probability that a binary metastasis status prediction iscorrect at a given ctFE threshold. Accordingly, the model predictsmetastasis with greater confidence at higher ctFE.

Estimated Tumor Fraction Correlates with Response to Treatment.

To determine how ctFE changes in response to treatment, comparisonsbetween ctFE with the most recent clinical response outcome weredetermined. Accordingly, referring to FIG. 12A, patients classified ashaving complete response were determined to have a significantly lowermedian ctFE of 0.05, compared to 0.06, 0.06, and 0.08 in patients withstable disease, partial response, and progressive disease, respectively.Additionally, referring to FIG. 12B, patients with multiple liquidbiopsy tests were determined to have large differences in ctFE betweentest dates. For example, referring to FIG. 12C, one breast cancer casehad a ctFE of 0.05 at initial liquid biopsy testing. After treatmentwith bevacizumab and paclitaxel, clinical notes indicate the patient wasclassified as having stable disease. Eribulin treatment was startedshortly after, but the patient was later diagnosed with progressivedisease. A second liquid biopsy test, which was performed approximately200 days after the initial liquid biopsy test, revealed a ctFE of 0.26,which supports the progressive disease diagnosis. Alternatively, in abreast cancer patient with progressive disease who was treated withinvestigational new drug therapies, the patient's status was updated tostable disease shortly after the first liquid biopsy test, whichrevealed a ctFE of 0.05. Approximately 100 days later, the patient'ssecond liquid biopsy test revealed a ctFE of 0.09. The patient likelyreceived no further treatment before the third liquid biopsy test, whichrevealed a ctFE of 0.27, suggesting this patient's disease hadprogressed. Specifically, FIG. 12A illustrates circulating tumorfraction estimate (cfTE) and abstracted clinical outcomes in asub-cohort of the liquid biopsy 1000 (n=388) in which patients withcomplete response (n=9, ctFE=0.05) exhibited lower ctFE than those withprogressive disease (n=298, ctFE=0.08), partial response (n=56,ctFE=0.06), or stable disease (n=25, ctFE=0.06). FIG. 12B illustratesthat ctFE was also assessed temporally among a few randomly selectedpatients with multiple liquid biopsy tests throughout the course oftreatment (n=26), with most patients showing large differences in ctFEbetween test dates. FIG. 12C illustrates four exemplary caseshighlighting the utility of ctFE in relation to treatment course anddisease status.

In the case of a lung cancer patient who underwent multiple rounds oftreatment, including carboplatin, pemetrexed, and etoposide, a decreasein ctFE between liquid biopsy tests (0.72 to 0.47) was determined.However, the ctFE was still extremely high after treatment, makingprogressive disease likely. Indeed, the patient was classified as havingprogressive disease by their oncologist shortly before the second liquidbiopsy test date. Alternatively, a patient who had undergone treatmentwith osimertinib and crizotinib approximately 50 days before the firstliquid biopsy test showed very little change in ctFE between test dates(0.3-0.11) and was classified as stable shortly before the second liquidbiopsy test. Referring to FIGS. 11A through 12C, while conclusions aboutthe larger population based on these individual cases cannot bedetermined, the changes in ctFE in response to treatment is consistentwith the above analyses showing that higher ctFEs are associated withadvanced disease. Additionally, these results illustrate how serialtesting can be beneficial for precision oncology in individual patients.These results further highlight the need for longitudinal studies withserial liquid biopsy testing in a larger cohort of patients.

While liquid biopsy is a promising tool for improving outcomes inprecision oncology, there are challenges that must be overcome before itcan replace large panel NGS tissue genotyping. For example, in earlystage disease, when treatments have much higher success rates, manypatients have low ctDNA fractions that may be below the LOD for liquidbiopsies, limiting clinical utility because of the risk of falsenegatives. Bettegowda et al., 2014, “Detection of circulating tumor DNAin early- and late-stage human malignancies,” Sci Transl Med, (6), pg.224; Xue et al., 2019, “Early detection and monitoring of cancer inliquid biopsy: advances and challenges,” Expert Rev Mol Diagn, (19), pg.273; Hennigan et al., 2019, “Low Abundance of Circulating Tumor DNA inLocalized Prostate Cancer,” JCO Precis Oncol, (3), print; Abbosh et al.,2018, “Early stage NSCLC—challenges to implementing ctDNA-basedscreening and MRD detection,” Nat Rev Clin Oncol, (15), pg. 577.Consequently, most studies to date have focused on late stage patientsfor assay validation and research. Furthermore, while validation studiesof existing liquid biopsy assays have shown high sensitivity andspecificity, few studies have corroborated results with orthogonalmethods, or between NGS testing platforms. Cheng et al., 2019, “ClinicalValidation of a Cell-Free DNA Gene Panel,” J Mol Diagn, (21), pg. 632;Hanibuchi et al., 2019, “Development, validation, and comparison of geneanalysis methods for detecting EGFR mutation from non-small cell lungcancer patients-derived circulating free DNA,” Oncotarget, (10), pg.3654; Van Laar et al., 2018, “Development and validation of aplasma-based melanoma biomarker suitable for clinical use,” Br J Cancer,(118), pg. 857; Odegaard et al., 2018, “Validation of a Plasma-BasedComprehensive Cancer Genotyping Assay Utilizing Orthogonal Tissue- andPlasma-Based Methodologies,” Clin Cancer Res, (24), pg. 3539; Clark etal., 2018, “Analytical Validation of a Hybrid Capture-BasedNext-Generation Sequencing Clinical Assay for Genomic Profiling ofCell-Free Circulating Tumor DNA,” J Mol Diagn, (20), pg. 686; Plagnol etal., 2018, “Analytical validation of a next generation sequencing liquidbiopsy assay for high sensitivity broad molecular profiling,” PLoS One,(13), pg. 0193802. Kuderer et al. compared commercially available liquidand tissue NGS platforms and found only 22% concordance in geneticalterations. Kuderer et al., 2017, “Comparison of 2 CommerciallyAvailable Next-Generation Sequencing Platforms in Oncology,” JAMA Oncol,(3), pg. 996. Other reports of liquid biopsy based studies are limitedby comparison to non-comprehensive tissue testing algorithms includingSanger sequencing, small NGS hotspot panels, PCR and FISH, which may notcontain all NCCN guideline genes in their reportable range, thussuffering in comparison to a more comprehensive liquid biopsy assay.Leighl et al., 2019. Since the 105 gene liquid biopsy assay is a subsetof the 648 gene solid tumor tissue-based assay, the concordance datapresented herein (74.31% for actionable variants) represents a directcomparison to a comprehensive NGS test which includes the entirereportable range of the liquid biopsy assay. Beaubier et al., 2019,“Integrated genomic profiling expands clinical options for patients withcancer,” Nat Biotechnol, (37), pg. 1351. While this concordance is highrelative previous reports, 25.69% of actionable variants would have beenmissed if only one of the tests were performed. Thus, liquid biopsiesprovide the greatest value to patients when used in combination withstandard tissue genotyping. Furthermore, having both tests enabledadditional analyses to exclude germline and CH variants, significantlyimproving specificity.

Accordingly, the systems and methods of the present disclosure providesanalytical and clinical validation of the liquid biopsy assay. Thesystems and methods of the present disclosure provide high accuracycompared to orthogonal methods, including tissue biopsy, Avenio liquidbiopsy, ddPCR, and LPWGS. The systems and methods of the presentdisclosure also provide improvements upon existing methodologies forestimating circulating tumor fraction. Notably, in combination withreal-world clinical data, the systems and methods of the presentdisclosure demonstrate the value and suitability of liquid biopsytesting for monitoring disease progression, predicting objectivemeasures of response, and assessing treatment outcomes. As such, theresults obtained through validating the systems and methods of thepresent disclosure strongly support utilizing the liquid biopsy assay inroutine monitoring of cancer patients with advanced disease.

REFERENCES CITED AND ALTERNATIVE EMBODIMENTS

All references cited herein are incorporated herein by reference intheir entirety and for all purposes to the same extent as if eachindividual publication or patent or patent application was specificallyand individually indicated to be incorporated by reference in itsentirety for all purposes.

The present invention can be implemented as a computer program productthat comprises a computer program mechanism embedded in a non-transitorycomputer readable storage medium. These program modules can be stored ona CD-ROM, DVD, magnetic disk storage product, USB key, or any othernon-transitory computer readable data or program storage product.

Many modifications and variations of this invention can be made withoutdeparting from its spirit and scope, as will be apparent to thoseskilled in the art. The specific embodiments described herein areoffered by way of example only. The embodiments were chosen anddescribed in order to best explain the principles of the invention andits practical applications, to thereby enable others skilled in the artto best utilize the invention and various embodiments with variousmodifications as are suited to the particular use contemplated. Theinvention is to be limited only by the terms of the appended claims,along with the full scope of equivalents to which such claims areentitled.

What is claimed is:
 1. A method of validating a somatic sequence variantin a cancerous tissue of a test subject having a cancer condition, themethod comprising: at a computer system having one or more processors,and memory storing one or more programs for execution by the one or moreprocessors: (A) obtaining, from a first sequencing reaction, acorresponding sequence of each cell-free DNA fragment in a firstplurality of cell-free DNA fragments in a liquid biopsy sample of thetest subject, thereby obtaining a first plurality of sequence reads,wherein the first plurality of sequence reads comprises at least 100,000sequence reads; (B) aligning each respective sequence read in the firstplurality of sequence reads to a reference sequence for the species ofthe subject thereby identifying a candidate somatic sequence variantmapping to a respective locus in the reference sequence; (C) determiningfor the candidate somatic sequence variant, (i) a respective variantallele fragment count for the first sequencing reaction, and (ii) arespective locus fragment count for the first sequencing reaction; and(D) comparing the respective variant allele fragment count for thecandidate somatic sequence variant against a dynamic variant countthreshold for the respective locus in the reference sequence that thecandidate variant maps to, wherein the dynamic variant count thresholdis based upon at least a pre-test odds of a positive variant call forthe respective locus based upon a prevalence of variants in a genomicregion that includes the respective locus in a cohort of trainingsubjects having the cancer condition, and: when the variant allelefragment count for the candidate somatic sequence variant satisfies thedynamic variant count threshold for the respective locus, not rejectingthe presence of the candidate somatic sequence variant in the testsubject, or when the variant allele fragment count for the candidatesomatic sequence variant does not satisfy the dynamic variant countthreshold for the locus, rejecting the presence of the candidate somaticsequence variant in the test subject.
 2. A method of validating asomatic sequence variant in a test subject having a cancer condition,the method comprising: at a computer system having one or moreprocessors, and memory storing one or more programs for execution by theone or more processors: (A) obtaining, from a first sequencing reaction,a corresponding sequence of each cell-free DNA fragment in a firstplurality of cell-free DNA fragments in a liquid biopsy sample of thetest subject, thereby obtaining a first plurality of sequence reads; (B)aligning each respective sequence read in the first plurality ofsequence reads to a reference sequence for the species of the subjectthereby identifying a candidate somatic sequence variant mapping to arespective locus in the reference sequence, wherein the referencesequence for the species represents at least 1 Mb of the genome for thespecies; (C) determining for the candidate somatic sequence variant, (i)a respective variant allele fragment count for the first sequencingreaction, and (ii) a respective locus fragment count for the firstsequencing reaction; and (D) comparing the respective variant allelefragment count for the candidate somatic sequence variant against adynamic variant count threshold for the respective locus in thereference sequence that the candidate variant maps to, wherein thedynamic variant count threshold is based upon at least a pre-test oddsof a positive variant call for the respective locus based upon aprevalence of variants in a genomic region that includes the respectivelocus in a cohort of training subjects having the cancer condition, and:when the variant allele fragment count for the candidate somaticsequence variant satisfies the dynamic variant count threshold for therespective locus, not rejecting the presence of the candidate somaticsequence variant in the test subject, or when the variant allelefragment count for the candidate somatic sequence variant does notsatisfy the dynamic variant count threshold for the locus, rejecting thepresence of the candidate somatic sequence variant in the test subject.3. The method of claim 1 or 2, wherein the dynamic variant countthreshold is also based upon a sequencing error rate for the sequencingreaction.
 4. The method of claim 3, wherein the sequencing error ratefor the sequencing reaction is a trinucleotide sequencing error rate. 5.The method of any one of claims 1-4, wherein the dynamic variant countthreshold is also based upon a background sequencing error ratedetermined for the locus.
 6. The method of any one of claims 1-5, methodfurther comprising: determining the dynamic variant count thresholdbased upon a variant detection specificity determined according to therelationship:${specificity} = {1 - \left( {({sensitivity}) \times \left( \frac{{odds}\left( {{pre} - {test}} \right)}{{odds}\left( {{post} - {test}} \right)} \right)} \right)}$wherein, sensitivity is a variant detection sensitivity selected from adistribution of variant detection sensitivities based on an estimatedcirculating variant fraction for the candidate variant, odds(post-test)is a post-test odds of a positive variant call for the locus, andodds(pre-test) is the pre-test odds of the positive variant call for thelocus.
 7. The method of claim 6, wherein the distribution of variantdetection sensitivities is based on a correlation between (i) thedetection rate of a reference variant allele, in one or more sequencingreactions that are process-matched with the first sequencing reaction,for a plurality of cancer samples, and (ii) the variant allele fractionsfor the reference variant allele in the cancer samples.
 8. The method ofclaim 7, wherein the correlation is established by determining, for eachrespective bin in a plurality of bins collectively representing a spanof variant allele fractions represented in the cancer samples, whereineach respective bin corresponds to a contiguous span of variant allelefractions that does not overlap with any other respective bin in theplurality of bins, a corresponding sensitivity for detection of thereference variant alleles for the corresponding subset of cancersamples.
 9. The method of any one of claims 6-8, wherein the estimatedcirculating variant fraction for the candidate variant is a variantallele fraction determined from a comparison of (i) the respectivevariant allele fragment count for the first sequencing reaction, to (ii)the respective locus fragment count for the first sequencing reaction10. The method of any one of claims 6-9, wherein the specificity is usedto select a quantile of a beta-binomial distribution of the minimalvariant allele fragment count required to support a positive variantcall for the locus, thereby defining the dynamic threshold for thelocus, wherein the beta-binomial distribution is defined by a sequencingerror rate for the sequencing reaction and a background sequencing errorrate determined for the locus.
 11. The method of any one of claims 6-10,wherein the pre-test odds of a positive variant call for the locus isbased on the prevalence of variants in a genomic region that includesthe locus from the first set of nucleic acids obtained from the cohortof subjects having the cancer condition.
 12. The method of claim 11,wherein, when the genomic region that includes the locus is associatedwith a mutation known to confer resistance against a therapy used totreat the cancer condition, the pre-test odds are boosted based on apre-test-odds multiplier specific for the genomic region.
 13. The methodof claim 11 or 12, wherein the pre-test odds of a positive variant callfor the locus is further based on a known or inferred effect of thevariants, wherein: when the known or inferred effect of a variant isloss-of-function of a gene that includes the locus, the genomic regionused to compute the pre-test probability is the entire gene, and whenthe known or inferred effect of a variant is gain-of-function of thegene that includes the locus, the genomic region used to compute thepre-test probability is the exon, of the gene, that includes the locus.14. The method of claim 13, wherein the effect of the variants isinferred by: binning each respective variant of the variants in thegenomic region that includes the locus from the first set of nucleicacids obtained from the cohort of subjects having the cancer conditioninto a respective bin, in a plurality of bins for the gene that includethe locus, corresponding to the exon encompassing the respective variantin the gene, wherein each bin in the plurality of bins corresponds to adifferent exon of the respective gene; and determining whether any binin the plurality of bins contains significantly more variants than theother bins in the plurality of bins, wherein: when a bin containssignificantly more variants than the other bins in the plurality ofbins, the effect of the sequence variant is inferred to be again-of-function of the gene, and when no bin in the plurality of binscontains significantly more sequence variants than the other bins in theplurality of bins, the effect of the sequence variant is inferred to bea loss-of-function of the gene.
 15. The method of claim 14, whereindetermining whether any bin in the plurality of bins containssignificantly more variants than the other bins in the plurality of binscomprises applying a rolling Poisson test of difference between bincounts corresponding to adjacent exons in the gene.
 16. The method ofany one of claims 1-15, wherein the liquid biopsy sample is blood. 17.The method of any one of claims 1-15, wherein the liquid biopsy samplecomprises blood, whole blood, peripheral blood, plasma, serum, or lymphof the test subject.
 18. The method of any one of claims 1-17, wherein:the first sequencing reaction is a panel-enriched sequencing reaction ofa first plurality of enriched loci, and each respective locus in theplurality of enriched loci are sequenced at an average unique sequencedepth of at least 250×.
 19. The method of claim 18, wherein eachrespective locus in the plurality of enriched loci are sequenced at anaverage unique sequence depth of at least 1000×.
 20. The method of claim18 or 19, wherein the panel-enriched sequencing reaction uses asequencing panel that enriches for at least 50 genes.
 21. The method ofany one of claims 18-20, wherein the panel-enriched sequencing reactionuses a sequencing panel that enriches for at least 10 genes listed inTable
 1. 22. The method of any one of claims 18-21, wherein thepanel-enriched sequencing reaction uses a sequencing panel that enrichesfor at least 10 genes listed in List
 1. 23. The method of any one ofclaims 18-22, wherein the panel-enriched sequencing reaction uses asequencing panel that enriches for at least 10 genes listed in List 2.24. The method of any one of claims 1-23, wherein: the first sequencingreaction is a whole genome sequencing reaction, and the averagesequencing depth of the reaction across the genome is at least 25×. 25.The method of any one of claims 1-24, wherein the first plurality ofsequence reads comprises at least 50,000 sequence reads.
 26. The methodof any one of claims 1-24, wherein the first plurality of sequence readscomprises at least 250,000 sequence reads.
 27. The method of any one ofclaims 1-26, wherein the cancer condition is a particular type and stageof cancer.
 28. The method of any one of claims 1-27, wherein the cohortof subjects having the cancer condition are matched to at least onepersonal characteristic of the subject.
 29. The method of any one ofclaims 1-27, wherein, when the variant allele fragment count for thecandidate variant satisfies the dynamic variant count threshold for thelocus and all other variant calling
 30. The method of any one of claims1-28, further comprising generating a report for the test subjectcomprising the identity of variant alleles having variant allele counts,in the first sequencing reaction, that satisfy the dynamic variant countthreshold.
 31. The method of claim 30, wherein the generated reportfurther comprises therapeutic recommendations for the test subject basedon the identity of one or more of the reported variant alleles.
 32. Acomputer system comprising: one or more processors; and a non-transitorycomputer-readable medium including computer-executable instructionsthat, when executed by the one or more processors, cause the processorsto perform a method according to any one of claims 1-31.
 33. Anon-transitory computer-readable storage medium having stored thereonprogram code instructions that, when executed by a processor, cause theprocessor to perform the method according to any one of claims 1-31.